Here we can type long passages or descriptions of our data without the need of “hashtaging” our comments with the # symbol. In our first example we will be using the ToothGrowth dataset. In this experiemnt Guinea pigs were given different amounts of vitaimin C to see the affects on the animals teeth growth
To run R code in a markdown file, we need to denote the section that is considered R code. We call these “code chunks”
Below is a code chunk:
Toothdata <- ToothGrowth
head(Toothdata)
## len supp dose
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 6 10.0 VC 0.5
As you can see, from running the “play” button results in the code chunk printed inline of the r markdown file.
fit <- lm(len ~ dose, data = Toothdata)
b <- fit$coefficients
plot(len ~ dose, data = Toothdata)
abline(lm(len ~ dose, data = Toothdata))
The Tooth Growth of Guniea Pigs When Given Variable Ammounts of Vitamins
the slope of the regression line is 9.7635714,
we can also put sections and subsections in our r markdown file, similar to numbers or bullet points in a word document. This is done with the “#” that we previously used to denote text in an R script.
make sure you put a space after the hashtag, otherwise it will not work!
we can also add bullet points-type marks in our r markdown file.
its important to note here that r markdown indentation matters!
we can put really nice quootes into the markdown document. We do this by using the “>” symbool.
“Genes are like the story, and DNA is the language that the story is written in.”
— Sam Kean
“If you ain’t first you’re last”
— Ricky Bobby
Hyperlinks can also be incorperated into these files. This is especially useful in HTML files, since they are in a web browser anf will redirect the reader to the material that you are interested in showing the. Here we will use the link to R markdown’s homepage for this example. RMarkdown
We can also put nice formated formulas into Markdown using two dollar signs.
Hard-Weinberg Formula
\[p^2 + 2pq + q^2 = 1\] and you can get really complex as well
\[\Theta = \begin{pmatrix}\alpha & \beta\\ \gamma & \delta \end{pmatrix}\]
There are alsso options for your R markdown file on how Knitr interprits the code chunk. There are the following options.
Eval (T or F): whether or not to evaluate the code chunk
Echo (T or F): whether or not to show the code for the chunk, but results will still print.
cache: if you enable, the same code chunk will not be evaluated the next time that the knitr is run. Great for code that has LONG run times.
fig.width or fig.height: the (graphical device) size of the R plots in inches. The figures are first written to the knitr document then to files that are saved separately.
out.width or out.height: The output size of the R plots IN THE DOCUMENT.
fig.cap: the words for the figure caption
we can also add a table of contents to out HTML Document. We do this by altering the YAML code ( the weird code chunk at the very top of the document.) We can add this: title: “Final Project BISC 516” author: “Megan Barr” date: “2024-11-11” output: html_document: toc: true toc_float: true
This will give us a floating table of contents on the right hand side of the document.
You can also add TABS in our report. To do this you need to specify each section that you want to become a tab by placing “{.tabset}” after the line. Every subsequent header will be a new tab.
you can also add themes to your HTML document that change the highlighting color and hyperlink color of your HTML output. This can be nice aestheticlly To do this, you can change yoour theme in te YAML to one of the following.
You can alsso change the color by specifying highlights {.tabset}
Load libraries and data
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(nycflights13)
my_data <- nycflights13::flights
head(my_data)
## # A tibble: 6 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
first we will just look at the data on October 14th.
## # A tibble: 987 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 10 14 451 500 -9 624 648
## 2 2013 10 14 511 517 -6 733 757
## 3 2013 10 14 536 545 -9 814 855
## 4 2013 10 14 540 545 -5 932 933
## 5 2013 10 14 548 545 3 824 827
## 6 2013 10 14 549 600 -11 719 730
## 7 2013 10 14 552 600 -8 650 659
## 8 2013 10 14 553 600 -7 646 700
## 9 2013 10 14 554 600 -6 836 829
## 10 2013 10 14 555 600 -5 832 855
## # ℹ 977 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
If we want to subset this into a new variable, we do the following:
oct_14_flight <- filter(my_data, month == 10, day ==14)
what if you want to do both print and save the variable?
(oct_14_flight_2 <- filter(my_data, month == 10, day ==14))
## # A tibble: 987 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 10 14 451 500 -9 624 648
## 2 2013 10 14 511 517 -6 733 757
## 3 2013 10 14 536 545 -9 814 855
## 4 2013 10 14 540 545 -5 932 933
## 5 2013 10 14 548 545 3 824 827
## 6 2013 10 14 549 600 -11 719 730
## 7 2013 10 14 552 600 -8 650 659
## 8 2013 10 14 553 600 -7 646 700
## 9 2013 10 14 554 600 -6 836 829
## 10 2013 10 14 555 600 -5 832 855
## # ℹ 977 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
(flight_through_september <- filter(my_data, month < 10))
## # A tibble: 252,484 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ℹ 252,474 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
If we don’t use the == to mean equals, we get this:
(oct_14_flight_2 <- filter(my_data, month = 10, day =14))
Let’s use the “or” functioon to pick flights in march and april
March_April_Flights <- filter(my_data, month == 3 | month ==4)
March_4th_Flights <- filter(my_data, month == 3 & day ==4)
Non_jan_flights <- filter(my_data, month != 1)
Arrange allows us to arrange the dataset based on the variables we desire.
arrange(my_data, year, day, month)
## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
We can also do this in descending fashion
descending <- arrange(my_data, desc(year), desc(day), desc(month))
misssing values are always placed at the end of the dataframe regardless of ascending or descending.
We can also select specfic columns that we wnant to looka at.
calander <- select(my_data, year, month, day)
print(calander)
## # A tibble: 336,776 × 3
## year month day
## <int> <int> <int>
## 1 2013 1 1
## 2 2013 1 1
## 3 2013 1 1
## 4 2013 1 1
## 5 2013 1 1
## 6 2013 1 1
## 7 2013 1 1
## 8 2013 1 1
## 9 2013 1 1
## 10 2013 1 1
## # ℹ 336,766 more rows
We can also look at a range of columns
calender2 <- select(my_data, year:day)
Let’s look at all columns months through carrier
calender3 <- select(my_data, year:carrier)
We can also chooe which NOT to include
everything_else <- select(my_data, -(year:day))
In this instance we can also use the “not” opperator !
everything_else_2 <- select(my_data, !(year:day))
head(my_data)
## # A tibble: 6 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
rename(my_data, departure_time = dep_time)
## # A tibble: 336,776 × 19
## year month day departure_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # ℹ 336,766 more rows
## # ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
my_data <- rename(my_data, departure_time = dep_time)
What if you want to add new columns to your data frame? we have the mutate() function for that.
First, lets make smaller data frame so we can see what we’re doing.
my_data_small <- select(my_data, year:day, distance, air_time)
Lets calculate the speed of the flights.
mutate(my_data_small, speed = distance / air_time * 60)
## # A tibble: 336,776 × 6
## year month day distance air_time speed
## <int> <int> <int> <dbl> <dbl> <dbl>
## 1 2013 1 1 1400 227 370.
## 2 2013 1 1 1416 227 374.
## 3 2013 1 1 1089 160 408.
## 4 2013 1 1 1576 183 517.
## 5 2013 1 1 762 116 394.
## 6 2013 1 1 719 150 288.
## 7 2013 1 1 1065 158 404.
## 8 2013 1 1 229 53 259.
## 9 2013 1 1 944 140 405.
## 10 2013 1 1 733 138 319.
## # ℹ 336,766 more rows
my_data_small <- mutate(my_data_small, speed = distance / air_time * 60)
what if we wanted to create a new data frame with ONLY your calculations (transmute)
airspeed <- transmute(my_data_small, speed = distance / air_time * 60 , speed2 = distance / air_time)
we can use summarize to run a function on a data column to get a single return
summarize(my_data, delay = mean(dep_delay, na.rm = T))
## # A tibble: 1 × 1
## delay
## <dbl>
## 1 12.6
So we can see here the average delay is about 12 minutes
we gain additional value in summerize by pairing it with by_group()
by_day <- group_by(my_data, year, month, day)
summarize(by_day, delay = mean(dep_delay, na.rm = T))
## `summarise()` has grouped output by 'year', 'month'. You can override using the
## `.groups` argument.
## # A tibble: 365 × 4
## # Groups: year, month [12]
## year month day delay
## <int> <int> <int> <dbl>
## 1 2013 1 1 11.5
## 2 2013 1 2 13.9
## 3 2013 1 3 11.0
## 4 2013 1 4 8.95
## 5 2013 1 5 5.73
## 6 2013 1 6 7.15
## 7 2013 1 7 5.42
## 8 2013 1 8 2.55
## 9 2013 1 9 2.28
## 10 2013 1 10 2.84
## # ℹ 355 more rows
As you can see, we now have the delay by the days of the year
What happens if we dont tell R what to do with the missing data?
summarize(by_day, delay = mean(dep_delay))
## `summarise()` has grouped output by 'year', 'month'. You can override using the
## `.groups` argument.
## # A tibble: 365 × 4
## # Groups: year, month [12]
## year month day delay
## <int> <int> <int> <dbl>
## 1 2013 1 1 NA
## 2 2013 1 2 NA
## 3 2013 1 3 NA
## 4 2013 1 4 NA
## 5 2013 1 5 NA
## 6 2013 1 6 NA
## 7 2013 1 7 NA
## 8 2013 1 8 NA
## 9 2013 1 9 NA
## 10 2013 1 10 NA
## # ℹ 355 more rows
We can also filter our data based on NA (which in this dataset was cancelled flights)
not_cancalled <- filter(my_data, !is.na(dep_delay), !is.na(arr_delay))
Lets run summarize again on this data
summarize(not_cancalled, delay = mean(dep_delay))
## # A tibble: 1 × 1
## delay
## <dbl>
## 1 12.6
We can also count the number of variables that are NA
sum(is.na(my_data$dep_delay))
## [1] 8255
We can also count the number that are NOT NA
sum(!is.na(my_data$dep_delay))
## [1] 328521
With tibble data sets (more on them soon), we can pipe results to get rid of the dollar signs We can use them to summarize the number of flights by minutes delayed.
my_data %>%
group_by(year, month, day) %>%
summarize(mean = mean(departure_time, na.rm = T))
## `summarise()` has grouped output by 'year', 'month'. You can override using the
## `.groups` argument.
## # A tibble: 365 × 4
## # Groups: year, month [12]
## year month day mean
## <int> <int> <int> <dbl>
## 1 2013 1 1 1385.
## 2 2013 1 2 1354.
## 3 2013 1 3 1357.
## 4 2013 1 4 1348.
## 5 2013 1 5 1326.
## 6 2013 1 6 1399.
## 7 2013 1 7 1341.
## 8 2013 1 8 1335.
## 9 2013 1 9 1326.
## 10 2013 1 10 1333.
## # ℹ 355 more rows
load library
library(tibble)
Now we will take the time to explore tibbles. Tibbles are modified data frames which tweak some of the older Features from data frames. R is an old language, and useful things from 20 years ago are not as useful anymore.
as_tibble(iris)
## # A tibble: 150 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## # ℹ 140 more rows
as we can see, we have the same data frame, but we have different features
you can also create a tibble from scratch with tibble()
tibble(
x = 1:5,
y = 1,
z = x ^ 2 + y
)
## # A tibble: 5 × 3
## x y z
## <int> <dbl> <dbl>
## 1 1 1 2
## 2 2 1 5
## 3 3 1 10
## 4 4 1 17
## 5 5 1 26
you can also ues tribble() for basic data table creation
tribble(
~geneA, ~ geneB, ~ geneC,
##########################
110, 112, 114,
6, 5, 4
)
## # A tibble: 2 × 3
## geneA geneB geneC
## <dbl> <dbl> <dbl>
## 1 110 112 114
## 2 6 5 4
Tibbles are built to not overwhelm your console when printing data, only showing the first few lines
This is how a data frame prints
Subsetting tibbles is easy, similar to data.frames
df_tibble <- tibble(nycflights13::flights)
df_tibble
## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
We can subset by column name using the $
we can subset by position using [[]]
if you want to use this in a pipe, you need to use the “.” placeholder
some older functions do not like tibbles, thus might have to convert them back to dataframes
class(df_tibble)
## [1] "tbl_df" "tbl" "data.frame"
df_tibble_2 <- as.data.frame(df_tibble)
class(df_tibble_2)
## [1] "data.frame"
df_tibble
## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
head(df_tibble_2)
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## arr_delay carrier flight tailnum origin dest air_time distance hour minute
## 1 11 UA 1545 N14228 EWR IAH 227 1400 5 15
## 2 20 UA 1714 N24211 LGA IAH 227 1416 5 29
## 3 33 AA 1141 N619AA JFK MIA 160 1089 5 40
## 4 -18 B6 725 N804JB JFK BQN 183 1576 5 45
## 5 -25 DL 461 N668DN LGA ATL 116 762 6 0
## 6 12 UA 1696 N39463 EWR ORD 150 719 5 58
## time_hour
## 1 2013-01-01 05:00:00
## 2 2013-01-01 05:00:00
## 3 2013-01-01 05:00:00
## 4 2013-01-01 05:00:00
## 5 2013-01-01 06:00:00
## 6 2013-01-01 05:00:00
library(tidyverse)
how do we make a tidy dataset? well the tidyversse follows three rules.
Its impossible to satisfy two of the three rules.
This leads to the following instructions
picking one consistent method of data storage makes for easier understanding of your code an what is happening “under the hood” or behinde the scenes.
Lets now look at working with tibbles
library(tidyverse)
bmi <- tibble(women)
bmi %>%
mutate(bmi = (703 * weight)/ (height)^2)
## # A tibble: 15 × 3
## height weight bmi
## <dbl> <dbl> <dbl>
## 1 58 115 24.0
## 2 59 117 23.6
## 3 60 120 23.4
## 4 61 123 23.2
## 5 62 126 23.0
## 6 63 129 22.8
## 7 64 132 22.7
## 8 65 135 22.5
## 9 66 139 22.4
## 10 67 142 22.2
## 11 68 146 22.2
## 12 69 150 22.1
## 13 70 154 22.1
## 14 71 159 22.2
## 15 72 164 22.2
Sometines you’ll find datasets that don’t fit well into a tibble We’ll use the built-in data from tidyverse for this part
table4a
## # A tibble: 3 × 3
## country `1999` `2000`
## <chr> <dbl> <dbl>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
as you can see form this data, we have one variable in column A (country) But columns B and C are two of the same, thus there are two observations in each row
To fix this we can use the gather function
table4a %>%
gather('1999', '2000' , key = 'year' , value ='cases')
## # A tibble: 6 × 3
## country year cases
## <chr> <chr> <dbl>
## 1 Afghanistan 1999 745
## 2 Brazil 1999 37737
## 3 China 1999 212258
## 4 Afghanistan 2000 2666
## 5 Brazil 2000 80488
## 6 China 2000 213766
lets look at another example
table4b
## # A tibble: 3 × 3
## country `1999` `2000`
## <chr> <dbl> <dbl>
## 1 Afghanistan 19987071 20595360
## 2 Brazil 172006362 174504898
## 3 China 1272915272 1280428583
as you can see we have the same problem in table 4b
table4b %>%
gather ("1999", "2000", key = "year", value = "population")
## # A tibble: 6 × 3
## country year population
## <chr> <chr> <dbl>
## 1 Afghanistan 1999 19987071
## 2 Brazil 1999 172006362
## 3 China 1999 1272915272
## 4 Afghanistan 2000 20595360
## 5 Brazil 2000 174504898
## 6 China 2000 1280428583
now what if we want to join these two tables? we can use dplyr
table4a <- table4a %>%
gather('1999', '2000' , key = 'year' , value ='cases')
table4b <- table4b %>%
gather ("1999", "2000", key = "year", value = "population")
left_join(table4a, table4b)
## Joining with `by = join_by(country, year)`
## # A tibble: 6 × 4
## country year cases population
## <chr> <chr> <dbl> <dbl>
## 1 Afghanistan 1999 745 19987071
## 2 Brazil 1999 37737 172006362
## 3 China 1999 212258 1272915272
## 4 Afghanistan 2000 2666 20595360
## 5 Brazil 2000 80488 174504898
## 6 China 2000 213766 1280428583
spreading is the opposite of gathering. Lets look at table 2
table2
## # A tibble: 12 × 4
## country year type count
## <chr> <dbl> <chr> <dbl>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
you can see that we have redundant info in columns 1 and 2 we can fix that by combining rows 1 and 2, 3 and 4, ect.
spread(table2, key = type, value = count)
## # A tibble: 6 × 4
## country year cases population
## <chr> <dbl> <dbl> <dbl>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
Type is the key of what we are turning into columns, the value is what becomes rows/observations
In summary spread makes long tables shorter and wideer gather makes wide tables narrower and longer
Now what happens when we have two observations stuck in one column?
table3
## # A tibble: 6 × 3
## country year rate
## <chr> <dbl> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
As you can see, the rate iis just the population and cases combined. We can use seperate the fix this
table3%>%
separate(rate, into = c("cases", "population"))
## # A tibble: 6 × 4
## country year cases population
## <chr> <dbl> <chr> <chr>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
However, if you notice, the column type is not correct.
table3%>%
separate(rate, into =c("cases", "populate"), conver = TRUE)
## # A tibble: 6 × 4
## country year cases populate
## <chr> <dbl> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
You can specify what you want to seperate based on.
table3%>%
separate(rate, into =c("cases", "populate"), sep = "/", conver = TRUE)
## # A tibble: 6 × 4
## country year cases populate
## <chr> <dbl> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
Lets make this look more tidy
table3 %>%
separate(
year,
into = c("centurey", "year"),
convert = TRUE,
sep = 2
)
## # A tibble: 6 × 4
## country centurey year rate
## <chr> <int> <int> <chr>
## 1 Afghanistan 19 99 745/19987071
## 2 Afghanistan 20 0 2666/20595360
## 3 Brazil 19 99 37737/172006362
## 4 Brazil 20 0 80488/174504898
## 5 China 19 99 212258/1272915272
## 6 China 20 0 213766/1280428583
What happens if we want to do the inverse of separate?
table5
## # A tibble: 6 × 4
## country century year rate
## <chr> <chr> <chr> <chr>
## 1 Afghanistan 19 99 745/19987071
## 2 Afghanistan 20 00 2666/20595360
## 3 Brazil 19 99 37737/172006362
## 4 Brazil 20 00 80488/174504898
## 5 China 19 99 212258/1272915272
## 6 China 20 00 213766/1280428583
table5 %>%
unite(date, century, year, sep = "")
## # A tibble: 6 × 3
## country date rate
## <chr> <chr> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
There can be two types of missing values, NA (explicit) or just no entry (implicit)
gene_data <- tibble(
gene = c('a', 'a', 'a', 'a', 'b', 'b', 'b'),
nuc = c(20, 22, 24, 25, NA, 42, 67),
run = c(1,2,3,4,2,3,4)
)
gene_data
## # A tibble: 7 × 3
## gene nuc run
## <chr> <dbl> <dbl>
## 1 a 20 1
## 2 a 22 2
## 3 a 24 3
## 4 a 25 4
## 5 b NA 2
## 6 b 42 3
## 7 b 67 4
The nucelotide count for Gene B run 2 is explicit missing The nucleotide count for Gene B run 1 is implicitly missing
One way we can make implicit missing values explicit is by putting runs in columns
gene_data %>%
spread(gene, nuc)
## # A tibble: 4 × 3
## run a b
## <dbl> <dbl> <dbl>
## 1 1 20 NA
## 2 2 22 NA
## 3 3 24 42
## 4 4 25 67
If we want to remove the missing values we can use spread and gather and na.rm =TRUE
gene_data %>%
spread(gene, nuc) %>%
gather(gene, nuc, 'a' : 'b', na.rm = TRUE)
## # A tibble: 6 × 3
## run gene nuc
## <dbl> <chr> <dbl>
## 1 1 a 20
## 2 2 a 22
## 3 3 a 24
## 4 4 a 25
## 5 3 b 42
## 6 4 b 67
Another way we can make missing values implicit vs explicit, is complete()
gene_data %>%
complete(gene, run)
## # A tibble: 8 × 3
## gene run nuc
## <chr> <dbl> <dbl>
## 1 a 1 20
## 2 a 2 22
## 3 a 3 24
## 4 a 4 25
## 5 b 1 NA
## 6 b 2 NA
## 7 b 3 42
## 8 b 4 67
Sometimes an NA is present to represent a value being varried forward
treatment <- tribble(
~person, ~treatment, ~response,
####################################################
"issac", 1, 7,
NA, 2, 10,
NA, 3, 9,
"VDB", 1, 8,
NA, 2, 11,
NA, 3, 10
)
treatment
## # A tibble: 6 × 3
## person treatment response
## <chr> <dbl> <dbl>
## 1 issac 1 7
## 2 <NA> 2 10
## 3 <NA> 3 9
## 4 VDB 1 8
## 5 <NA> 2 11
## 6 <NA> 3 10
What we can do here is use the full() option
treatment %>%
fill(person)
## # A tibble: 6 × 3
## person treatment response
## <chr> <dbl> <dbl>
## 1 issac 1 7
## 2 issac 2 10
## 3 issac 3 9
## 4 VDB 1 8
## 5 VDB 2 11
## 6 VDB 3 10
It is rare that you will be working with a simple data table. The DYPLR package allows you to join two data tabels based on common values.
Load libraries
library(tidyverse)
library(nycflights13)
Lets pull full carrier names based on letter codes
airlines
## # A tibble: 16 × 2
## carrier name
## <chr> <chr>
## 1 9E Endeavor Air Inc.
## 2 AA American Airlines Inc.
## 3 AS Alaska Airlines Inc.
## 4 B6 JetBlue Airways
## 5 DL Delta Air Lines Inc.
## 6 EV ExpressJet Airlines Inc.
## 7 F9 Frontier Airlines Inc.
## 8 FL AirTran Airways Corporation
## 9 HA Hawaiian Airlines Inc.
## 10 MQ Envoy Air
## 11 OO SkyWest Airlines Inc.
## 12 UA United Air Lines Inc.
## 13 US US Airways Inc.
## 14 VX Virgin America
## 15 WN Southwest Airlines Co.
## 16 YV Mesa Airlines Inc.
lets get info about airports
Lets get info about each plane
Lets get some info on the weather at the airporots
lets get info on singular flights
Keys are unique identifieers per observation primary key uniquley identifies an observation in its own table
one way to identify a primary key is as follows
planes %>%
count(tailnum) %>%
filter(n>1)
## # A tibble: 0 × 2
## # ℹ 2 variables: tailnum <chr>, n <int>
This indicated that the tailnumber is unique
planes %>%
count(model) %>%
filter(n>1)
## # A tibble: 79 × 2
## model n
## <chr> <int>
## 1 717-200 88
## 2 737-301 2
## 3 737-3G7 2
## 4 737-3H4 105
## 5 737-3K2 2
## 6 737-3L9 2
## 7 737-3Q8 5
## 8 737-3TO 2
## 9 737-401 4
## 10 737-4B7 18
## # ℹ 69 more rows
We have now added tthe airline name to our dataframe from the airline dataframe
other types of joins 1. inner jpins (inner_join) matches a pair of observations when their key is equal 2. outer joins (outer_joins) keeps observations that appear in at least one table.
load libraries
library(tidyverse)
library(stringr)
you can create strings using single or double quotes
string1 <- "This is a string"
string2 <- 'to put a "quote" in your string, ue the opposite'
string1
## [1] "This is a string"
string2
## [1] "to put a \"quote\" in your string, ue the opposite"
if you forget to close your string, you’ll get this: - string3 <- “where iss this string going? need to use paratnthes to close string if it does happen hit escape and try again
string3 <- "where is this string going?"
string3
## [1] "where is this string going?"
multiple strings are stored in character vectors
string4 <- c("one", "two", "three")
string4
## [1] "one" "two" "three"
measuring string length
str_length(string3)
## [1] 27
str_length(string4)
## [1] 3 3 5
lets combine two strings
str_c("x", "y")
## [1] "xy"
str_c(string1, string2)
## [1] "This is a stringto put a \"quote\" in your string, ue the opposite"
you can use sep to control how they are seperated
str_c(string1, string2, ssep = " ")
## [1] "This is a stringto put a \"quote\" in your string, ue the opposite "
str_c("x", "y", "z", sep = "_")
## [1] "x_y_z"
you can ssubset a string using str_sub()
HSP <- c("HSP123", "HSP234", "HSP456")
str_sub(HSP, 4,6)
## [1] "123" "234" "456"
this just drops the first four letters from the strings
or you can use negatives to count back from the end
str_sub(HSP, -3, -1)
## [1] "123" "234" "456"
you can convert the cases of strings like follows:
HSP
## [1] "HSP123" "HSP234" "HSP456"
str_to_lower(HSP)
## [1] "hsp123" "hsp234" "hsp456"
str_to_upper() - makes everything upper case instead of lower case
install.packages("htmlwidgets")
## Installing package into '/home/student/R/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
x <- c('ATTAGA', 'CGCCCCCGGAT', 'TATTA')
str_view(x, "G")
## [1] │ ATTA<G>A
## [2] │ C<G>CCCCC<G><G>AT
str_view(x, "TA")
## [1] │ AT<TA>GA
## [3] │ <TA>T<TA>
the next step is, “.” where the “.” matches an entry
str_view(x, ".G.")
## [1] │ ATT<AGA>
## [2] │ <CGC>CCC<CGG>AT
Anchor allow you to match at the start or the ending
str_view(x, "^TA")
## [3] │ <TA>TTA
str_view(x, "TA$")
## [3] │ TAT<TA>
###character classes/alternatives
str_view(x, "TA[GT]")
## [1] │ AT<TAG>A
## [3] │ <TAT>TA
[^anc] matches anything BUT a, b, or c
str_view(x, "TA[^T]")
## [1] │ AT<TAG>A
you can also use | to pick between two alternatives
str_view(x, "TA[G|T]")
## [1] │ AT<TAG>A
## [3] │ <TAT>TA
str_detect () returns a logical vector the same length of input
y <- c("apple", "bannana", "pear")
y
## [1] "apple" "bannana" "pear"
str_detect(y, "e")
## [1] TRUE FALSE TRUE
How many common words start with letter e
Lets get more complex, what proportion words end in a vowel?
Now lets extract
you can also use str count to say how many matches there are in a string
lets couple this with mutate
df <- tibble(
word = words,
count = seq_along(word)
)
df
## # A tibble: 980 × 2
## word count
## <chr> <int>
## 1 a 1
## 2 able 2
## 3 about 3
## 4 absolute 4
## 5 accept 5
## 6 account 6
## 7 achieve 7
## 8 across 8
## 9 act 9
## 10 active 10
## # ℹ 970 more rows
df%>%
mutate(
vowels = str_count(words, "[aeiou]"),
consstonants = str_count(words, "[^aeiou]")
)
## # A tibble: 980 × 4
## word count vowels consstonants
## <chr> <int> <int> <int>
## 1 a 1 1 0
## 2 able 2 2 2
## 3 about 3 3 2
## 4 absolute 4 4 4
## 5 accept 5 2 4
## 6 account 6 3 4
## 7 achieve 7 4 3
## 8 across 8 2 4
## 9 act 9 1 2
## 10 active 10 3 3
## # ℹ 970 more rows
Load Libraries
library(ath1121501cdf)
## Warning: replacing previous import 'AnnotationDbi::tail' by 'utils::tail' when
## loading 'ath1121501cdf'
## Warning: replacing previous import 'AnnotationDbi::head' by 'utils::head' when
## loading 'ath1121501cdf'
##
library(ath1121501.db)
## Loading required package: AnnotationDbi
## Loading required package: stats4
## Loading required package: BiocGenerics
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:lubridate':
##
## intersect, setdiff, union
## The following objects are masked from 'package:dplyr':
##
## combine, intersect, setdiff, union
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, aperm, append, as.data.frame, basename, cbind,
## colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
## get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
## match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
## Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
## table, tapply, union, unique, unsplit, which.max, which.min
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
## Loading required package: IRanges
## Loading required package: S4Vectors
##
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:lubridate':
##
## second, second<-
## The following objects are masked from 'package:dplyr':
##
## first, rename
## The following object is masked from 'package:tidyr':
##
## expand
## The following object is masked from 'package:utils':
##
## findMatches
## The following objects are masked from 'package:base':
##
## expand.grid, I, unname
##
## Attaching package: 'IRanges'
## The following object is masked from 'package:lubridate':
##
## %within%
## The following objects are masked from 'package:dplyr':
##
## collapse, desc, slice
## The following object is masked from 'package:purrr':
##
## reduce
##
## Attaching package: 'AnnotationDbi'
## The following object is masked from 'package:dplyr':
##
## select
## Loading required package: org.At.tair.db
##
##
library(annotate)
## Loading required package: XML
library(affy)
##
## Attaching package: 'affy'
## The following object is masked from 'package:lubridate':
##
## pm
library(limma)
##
## Attaching package: 'limma'
## The following object is masked from 'package:BiocGenerics':
##
## plotMA
library(oligo)
## Loading required package: oligoClasses
## Welcome to oligoClasses version 1.64.0
##
## Attaching package: 'oligoClasses'
## The following object is masked from 'package:affy':
##
## list.celfiles
## Loading required package: Biostrings
## Loading required package: XVector
##
## Attaching package: 'XVector'
## The following object is masked from 'package:purrr':
##
## compact
## Loading required package: GenomeInfoDb
##
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
##
## strsplit
## ================================================================================
## Welcome to oligo version 1.66.0
## ================================================================================
##
## Attaching package: 'oligo'
## The following object is masked from 'package:limma':
##
## backgroundCorrect
## The following objects are masked from 'package:affy':
##
## intensity, MAplot, mm, mm<-, mmindex, pm, pm<-, pmindex,
## probeNames, rma
## The following object is masked from 'package:lubridate':
##
## pm
## The following object is masked from 'package:dplyr':
##
## summarize
library(dplyr)
library(reshape)
##
## Attaching package: 'reshape'
## The following objects are masked from 'package:S4Vectors':
##
## expand, rename
## The following object is masked from 'package:lubridate':
##
## stamp
## The following object is masked from 'package:dplyr':
##
## rename
## The following objects are masked from 'package:tidyr':
##
## expand, smiths
Read the cel files
targets <- readTargets("Bric16_Targets.csv", sep = ",", row.names = "filename")
ab <- ReadAffy()
hist(ab)
Normlized microarray experiments
eset <- affy::rma(ab)
## Background correcting
## Normalizing
## Calculating Expression
pData(eset)
## sample
## Atha_Ler-0_sShoots_FLT_Rep1_F-F2_raw.CEL 1
## Atha_Ler-0_sShoots_FLT_Rep2_F-F3_raw.CEL 2
## Atha_Ler-0_sShoots_FLT_Rep3_F-F4_raw.CEL 3
## Atha_Ler-0_sShoots_GC_Rep1_H-F2_raw.CEL 4
## Atha_Ler-0_sShoots_GC_Rep2_H-F3_raw.CEL 5
## Atha_Ler-0_sShoots_GC_Rep3_H-F4_raw.CEL 6
Normalized data visulized
hist(eset)
charaterize the data
ID <- featureNames(eset)
symbol <- getSYMBOL(ID, "ath1121501.db")
affydata <- read.delim("AFFY_ATH1_array_elements.txt")
## Warning in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,
## : EOF within quoted string
###Differntial Gene Expression Analyssis
Fligth vs Ground
condition <- targets$gravity
design <- model.matrix(~factor(condition))
colnames(design) <- c("flight", "ground")
fit <- lmFit(eset, design)
fit <- eBayes(fit)
options(digits = 2)
top <- topTable(fit, coef =2, n=Inf, adjust='fdr')
Lets combine annotations
Annot <- data.frame(Gene = sapply(contents(ath1121501GENENAME), paste, collapse = ",'"),
KEGG = sapply(contents(ath1121501PATH), paste, collapse = ","),
GO = sapply(contents(ath1121501GO), paste, collapse = ","),
SYMBOL = sapply(contents(ath1121501SYMBOL), paste, collapse = ","),
ACCNUM = sapply(contents(ath1121501ACCNUM), paste, collapse = ","))
## Warning: The contents() method for Bimap objects is deprecated. Please use
## as.list() instead.
## Warning: The contents() method for Bimap objects is deprecated. Please use
## as.list() instead.
## Warning: The contents() method for Bimap objects is deprecated. Please use
## as.list() instead.
## Warning: The contents() method for Bimap objects is deprecated. Please use
## as.list() instead.
## Warning: The contents() method for Bimap objects is deprecated. Please use
## as.list() instead.
Merge all data into one data frame
all <- merge(Annot, top, by.x =0, by.y =0, all =TRUE)
all2 <- merge(all, affydata, by.x = "Row.names", by.y = "array_element_name")
Convert ACCNUM to character value
all$ACCNUM <- as.character(all$ACCNUM)
write.table(all2, file="BRIC_16_Final.csv", row.names = TRUE, col.names = TRUE, sep = "\t")
columns(org.At.tair.db)
## [1] "ARACYC" "ARACYCENZYME" "ENTREZID" "ENZYME" "EVIDENCE"
## [6] "EVIDENCEALL" "GENENAME" "GO" "GOALL" "ONTOLOGY"
## [11] "ONTOLOGYALL" "PATH" "PMID" "REFSEQ" "SYMBOL"
## [16] "TAIR"
foldchanges <- as.data.frame(all$logFC)
all2$entrez = mapIds(org.At.tair.db,
keys = all2$ACCNUM,
column = "ENTREZID",
keytype = "TAIR",
multivals = "first")
## 'select()' returned 1:1 mapping between keys and columns
head(all2, 10)
## Row.names Gene KEGG
## 1 244903_at hypothetical protein NA
## 2 244904_at hypothetical protein NA
## 3 244906_at hypothetical protein NA
## 4 244907_at hypothetical protein NA
## 5 244908_at hypothetical protein NA
## 6 244911_at hypothetical protein NA
## 7 244913_at hypothetical protein NA
## 8 244914_at hypothetical protein NA
## 9 244916_at hypothetical protein NA
## 10 244917_at hypothetical protein NA
## GO
## 1 list(GOID = "GO:0008150", Evidence = "ND", Ontology = "BP"),list(GOID = "GO:0005739", Evidence = "ISM", Ontology = "CC"),list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 2 list(GOID = "GO:0005739", Evidence = "ISM", Ontology = "CC"),list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 3 list(GOID = "GO:0008150", Evidence = "ND", Ontology = "BP"),list(GOID = "GO:0005739", Evidence = "ISM", Ontology = "CC"),list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 4 list(GOID = "GO:0008150", Evidence = "ND", Ontology = "BP"),list(GOID = "GO:0005739", Evidence = "ISM", Ontology = "CC"),list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 5 list(GOID = "GO:0008150", Evidence = "ND", Ontology = "BP"),list(GOID = "GO:0005739", Evidence = "ISM", Ontology = "CC"),list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 6 list(GOID = "GO:0008150", Evidence = "ND", Ontology = "BP"),list(GOID = "GO:0005575", Evidence = "ND", Ontology = "CC"),list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 7 list(GOID = "GO:0008150", Evidence = "ND", Ontology = "BP"),list(GOID = "GO:0005739", Evidence = "ISM", Ontology = "CC"),list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 8 list(GOID = "GO:0018130", Evidence = "IEA", Ontology = "BP"),list(GOID = "GO:0018130", Evidence = "IEA", Ontology = "BP"),list(GOID = "GO:0018130", Evidence = "IEA", Ontology = "BP"),list(GOID = "GO:0019438", Evidence = "IEA", Ontology = "BP"),list(GOID = "GO:0019438", Evidence = "IEA", Ontology = "BP"),list(GOID = "GO:0019438", Evidence = "IEA", Ontology = "BP"),list(GOID = "GO:0044271", Evidence = "IEA", Ontology = "BP"),list(GOID = "GO:0044271", Evidence = "IEA", Ontology = "BP"),list(GOID = "GO:0044271", Evidence = "IEA", Ontology = "BP"),list(GOID = "GO:1901362", Evidence = "IEA", Ontology = "BP"),list(GOID = "GO:1901362", Evidence = "IEA", Ontology = "BP"),list(GOID = "GO:0005739", Evidence = "ISM", Ontology = "CC"),list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 9 list(GOID = "GO:0008150", Evidence = "ND", Ontology = "BP"),list(GOID = "GO:0005634", Evidence = "ISM", Ontology = "CC"),list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 10 list(GOID = "GO:0008150", Evidence = "ND", Ontology = "BP"),list(GOID = "GO:0005739", Evidence = "ISM", Ontology = "CC"),list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## SYMBOL ACCNUM logFC AveExpr t P.Value adj.P.Val B
## 1 ORF149 ATMG00660 -1.214 7.8 -6.20 0.0004 0.018 0.5
## 2 ORF275 ATMG00670 -0.568 3.2 -2.07 0.0758 0.300 -4.9
## 3 ORF240A ATMG00690 0.055 9.1 0.18 0.8586 0.947 -6.8
## 4 ORF120 ATMG00710 -0.883 4.3 -2.91 0.0220 0.154 -3.7
## 5 ORF107D ATMG00720 -0.463 2.2 -1.73 0.1262 0.396 -5.4
## 6 ORF170 ATMG00820 -0.193 2.7 -0.72 0.4948 0.749 -6.5
## 7 ORF121B ATMG00840 -0.339 1.8 -1.18 0.2764 0.574 -6.1
## 8 ORF107E ATMG00850 -0.305 5.5 -1.30 0.2338 0.534 -6.0
## 9 ORF187 ATMG00880 -0.888 2.8 -2.32 0.0527 0.245 -4.6
## 10 ORF184 ATMG00870 -0.379 3.6 -1.31 0.2307 0.531 -5.9
## array_element_type organism is_control locus
## 1 oligonucleotide Arabidopsis thaliana no ATMG00660
## 2 oligonucleotide Arabidopsis thaliana no ATMG00670
## 3 oligonucleotide Arabidopsis thaliana no ATMG00690
## 4 oligonucleotide Arabidopsis thaliana no ATMG00710
## 5 oligonucleotide Arabidopsis thaliana no ATMG00720
## 6 oligonucleotide Arabidopsis thaliana no ATMG00820
## 7 oligonucleotide Arabidopsis thaliana no AT2G07626
## 8 oligonucleotide Arabidopsis thaliana no ATMG00850
## 9 oligonucleotide Arabidopsis thaliana no ATMG00880
## 10 oligonucleotide Arabidopsis thaliana no ATMG00870
## description
## 1 hypothetical protein;(source:Araport11)
## 2 transmembrane protein;(source:Araport11)
## 3 FO-ATPase subunit;(source:Araport11)
## 4 Polynucleotidyl transferase, ribonuclease H-like superfamily protein;(source:Araport11)
## 5 hypothetical protein;(source:Araport11)
## 6 Reverse transcriptase (RNA-dependent DNA polymerase);(source:Araport11)
## 7 hypothetical protein;(source:Araport11)
## 8 DNA/RNA polymerases superfamily protein;(source:Araport11)
## 9 hypothetical protein;(source:Araport11)
## 10 hypothetical protein;(source:Araport11)
## chromosome entrez
## 1 M <NA>
## 2 M <NA>
## 3 M <NA>
## 4 M <NA>
## 5 M <NA>
## 6 M <NA>
## 7 2 <NA>
## 8 M <NA>
## 9 M <NA>
## 10 M <NA>
##Pathview
library(pathview)
##
## ##############################################################################
## Pathview is an open source software package distributed under GNU General
## Public License version 3 (GPLv3). Details of GPLv3 is available at
## http://www.gnu.org/licenses/gpl-3.0.html. Particullary, users are required to
## formally cite the original Pathview paper (not just mention it) in publications
## or products. For details, do citation("pathview") within R.
##
## The pathview downloads and uses KEGG data. Non-academic uses may require a KEGG
## license agreement (details at http://www.kegg.jp/kegg/legal.html).
## ##############################################################################
library(gage)
##
library(gageData)
data("kegg.sets.hs")
data(kegg.sets.hs)
foldchanges = all2$logFC
names(foldchanges) = all2$entrez
head(foldchanges)
## <NA> <NA> <NA> <NA> <NA> <NA>
## -1.214 -0.568 0.055 -0.883 -0.463 -0.193
kegg.ath = kegg.gsets("ath", id.type = "entrez")
kegg.ath.sigmet = kegg.ath$kg.sets[kegg.ath$sigmet.idx]
Lets get the results
keggres = gage(foldchanges, gsets=kegg.ath.sigmet, same.dir =TRUE)
lapply(keggres, head)
## $greater
## p.geomean stat.mean p.val
## ath03010 Ribosome 1.5e-14 8.2 1.5e-14
## ath01230 Biosynthesis of amino acids 3.1e-04 3.5 3.1e-04
## ath00040 Pentose and glucuronate interconversions 2.0e-03 3.0 2.0e-03
## ath00195 Photosynthesis 2.6e-03 3.0 2.6e-03
## ath00966 Glucosinolate biosynthesis 7.0e-03 2.7 7.0e-03
## ath01232 Nucleotide metabolism 9.2e-03 2.4 9.2e-03
## q.val set.size exp1
## ath03010 Ribosome 1.9e-12 129 1.5e-14
## ath01230 Biosynthesis of amino acids 1.9e-02 85 3.1e-04
## ath00040 Pentose and glucuronate interconversions 8.0e-02 49 2.0e-03
## ath00195 Photosynthesis 8.0e-02 19 2.6e-03
## ath00966 Glucosinolate biosynthesis 1.6e-01 12 7.0e-03
## ath01232 Nucleotide metabolism 1.6e-01 42 9.2e-03
##
## $less
## p.geomean stat.mean p.val q.val
## ath04120 Ubiquitin mediated proteolysis 0.043 -1.7 0.043 1
## ath04016 MAPK signaling pathway - plant 0.043 -1.7 0.043 1
## ath00592 alpha-Linolenic acid metabolism 0.045 -1.7 0.045 1
## ath03040 Spliceosome 0.075 -1.4 0.075 1
## ath00350 Tyrosine metabolism 0.093 -1.4 0.093 1
## ath00906 Carotenoid biosynthesis 0.116 -1.2 0.116 1
## set.size exp1
## ath04120 Ubiquitin mediated proteolysis 63 0.043
## ath04016 MAPK signaling pathway - plant 74 0.043
## ath00592 alpha-Linolenic acid metabolism 19 0.045
## ath03040 Spliceosome 77 0.075
## ath00350 Tyrosine metabolism 19 0.093
## ath00906 Carotenoid biosynthesis 18 0.116
##
## $stats
## stat.mean exp1
## ath03010 Ribosome 8.2 8.2
## ath01230 Biosynthesis of amino acids 3.5 3.5
## ath00040 Pentose and glucuronate interconversions 3.0 3.0
## ath00195 Photosynthesis 3.0 3.0
## ath00966 Glucosinolate biosynthesis 2.7 2.7
## ath01232 Nucleotide metabolism 2.4 2.4
greater <- keggres$greater
lessers <- keggres$less
write.csv(greater, file = "BRIC_16_pathview_Greater_Pathways.csv")
write.csv(greater, file = "BRIC_16_pathview_Lesser_Pathways.csv")
Map to pathways (greater)
keggrespathways = data.frame(id=rownames(keggres$greater), keggres$greater) %>%
tbl_df() %>%
filter(row_number() <=5) %>%
.$id %>%
as.character()
## Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
## ℹ Please use `tibble::as_tibble()` instead.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
keggrespathways
## [1] "ath03010 Ribosome"
## [2] "ath01230 Biosynthesis of amino acids"
## [3] "ath00040 Pentose and glucuronate interconversions"
## [4] "ath00195 Photosynthesis"
## [5] "ath00966 Glucosinolate biosynthesis"
keggresids_greater = substr(keggrespathways, start=1, stop=8)
keggresids_greater
## [1] "ath03010" "ath01230" "ath00040" "ath00195" "ath00966"
genedata <- as.character(all2$logFC)
Define a plotting function
plot_pathway = function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = "ath", new.signature =FALSE)
#Plot multiple pathways
tmp = sapply(keggresids_greater, function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = "ath", kegg.native = FALSE))
## Info: Getting gene ID data from KEGG...
## Info: Done with data retrieval!
## Note: ath03010 not rendered, 0 or 1 connected nodes!
## Try "kegg.native=T" instead!
## Info: Getting gene ID data from KEGG...
## Info: Done with data retrieval!
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file ath01230.pathview.pdf
## Info: Getting gene ID data from KEGG...
## Info: Done with data retrieval!
## Warning in .local(from, to, graph): edges replaced: '217|104', '217|216'
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file ath00040.pathview.pdf
## Info: Getting gene ID data from KEGG...
## Info: Done with data retrieval!
## Note: ath00195 not rendered, 0 or 1 connected nodes!
## Try "kegg.native=T" instead!
## Info: Getting gene ID data from KEGG...
## Info: Done with data retrieval!
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file ath00966.pathview.pdf
map pathways to lessers
keggrespathways = data.frame(id=rownames(keggres$less), keggres$less) %>%
tbl_df() %>%
filter(row_number() <=5) %>%
.$id %>%
as.character()
## Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
## ℹ Please use `tibble::as_tibble()` instead.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
keggrespathways
## [1] "ath04120 Ubiquitin mediated proteolysis"
## [2] "ath04016 MAPK signaling pathway - plant"
## [3] "ath00592 alpha-Linolenic acid metabolism"
## [4] "ath03040 Spliceosome"
## [5] "ath00350 Tyrosine metabolism"
keggresids_less = substr(keggrespathways, start=1, stop=8)
keggresids_less
## [1] "ath04120" "ath04016" "ath00592" "ath03040" "ath00350"
genedata <- as.character(all2$logFC)
Define a plotting function
plot_pathway = function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = "ath", new.signature =FALSE)
Plot multiple pathways
tmp = sapply(keggresids_less, function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = "ath", kegg.native = FALSE))
## Info: Getting gene ID data from KEGG...
## Info: Done with data retrieval!
## Note: ath04120 not rendered, 0 or 1 connected nodes!
## Try "kegg.native=T" instead!
## Info: Getting gene ID data from KEGG...
## Info: Done with data retrieval!
## Warning: reconcile groups sharing member nodes!
## [,1] [,2]
## [1,] "7" "228"
## [2,] "8" "228"
## [3,] "7" "229"
## [4,] "8" "229"
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file ath04016.pathview.pdf
## Info: Getting gene ID data from KEGG...
## Info: Done with data retrieval!
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file ath00592.pathview.pdf
## Info: Getting gene ID data from KEGG...
## Info: Done with data retrieval!
## Note: ath03040 not rendered, 0 or 1 connected nodes!
## Try "kegg.native=T" instead!
## Info: Getting gene ID data from KEGG...
## Info: Done with data retrieval!
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file ath00350.pathview.pdf
load Libraries
library("edgeR")
library("dplyr")
library("AnnotationDbi")
library("org.Mm.eg.db")
##
rawdata = read.csv("GLDS-102_rna_seq_Normalized_Counts.csv", row.names = "gene_ID")
group <- factor(c('1', '1', '1', '1', '1', '1', '2', '2', '2', '2', '2', '2'))
dgeGlm <- DGEList(counts = rawdata, group = group)
str(dgeGlm)
## Formal class 'DGEList' [package "edgeR"] with 1 slot
## ..@ .Data:List of 2
## .. ..$ : num [1:24035, 1:12] 2976.8 59.8 21.2 40.1 0 ...
## .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. ..$ : chr [1:24035] "ENSMUSG00000000001" "ENSMUSG00000000028" "ENSMUSG00000000031" "ENSMUSG00000000037" ...
## .. .. .. ..$ : chr [1:12] "Mmus_C57_6J_KDN_FLT_Rep1_M23" "Mmus_C57_6J_KDN_FLT_Rep2_M24" "Mmus_C57_6J_KDN_FLT_Rep3_M25" "Mmus_C57_6J_KDN_FLT_Rep4_M26" ...
## .. ..$ :'data.frame': 12 obs. of 3 variables:
## .. .. ..$ group : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 2 2 2 2 ...
## .. .. ..$ lib.size : num [1:12] 40266365 40740336 37739541 39196969 36820645 ...
## .. .. ..$ norm.factors: num [1:12] 1 1 1 1 1 1 1 1 1 1 ...
## ..$ names: chr [1:2] "counts" "samples"
keep <- rowSums(cpm(dgeGlm)>2) >=4
dgeGlm <- dgeGlm[keep,]
names(dgeGlm)
## [1] "counts" "samples"
dgeGlm[["samples"]]
## group lib.size norm.factors
## Mmus_C57_6J_KDN_FLT_Rep1_M23 1 4.0e+07 1
## Mmus_C57_6J_KDN_FLT_Rep2_M24 1 4.1e+07 1
## Mmus_C57_6J_KDN_FLT_Rep3_M25 1 3.8e+07 1
## Mmus_C57_6J_KDN_FLT_Rep4_M26 1 3.9e+07 1
## Mmus_C57_6J_KDN_FLT_Rep5_M27 1 3.7e+07 1
## Mmus_C57_6J_KDN_FLT_Rep6_M28 1 3.6e+07 1
## Mmus_C57_6J_KDN_GC_Rep1_M33 2 3.7e+07 1
## Mmus_C57_6J_KDN_GC_Rep2_M34 2 3.7e+07 1
## Mmus_C57_6J_KDN_GC_Rep3_M35 2 4.0e+07 1
## Mmus_C57_6J_KDN_GC_Rep4_M36 2 3.6e+07 1
## Mmus_C57_6J_KDN_GC_Rep5_M37 2 3.8e+07 1
## Mmus_C57_6J_KDN_GC_Rep6_M38 2 3.5e+07 1
design <- model.matrix(~group)
design
## (Intercept) group2
## 1 1 0
## 2 1 0
## 3 1 0
## 4 1 0
## 5 1 0
## 6 1 0
## 7 1 1
## 8 1 1
## 9 1 1
## 10 1 1
## 11 1 1
## 12 1 1
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"
dgeGlmComDisp <- estimateGLMCommonDisp(dgeGlm, design, verbose = TRUE)
## Disp = 0.026 , BCV = 0.16
dgeGlmTrendDrisp <- estimateGLMTrendedDisp(dgeGlmComDisp, design)
dgeGlmTagDisp <- estimateGLMTagwiseDisp(dgeGlmTrendDrisp, design)
plotBCV(dgeGlmTagDisp)
fit <- glmFit(dgeGlmTagDisp, design)
colnames(coef(fit))
## [1] "(Intercept)" "group2"
lrt <- glmLRT(fit, coef = 2)
ttGlm <- topTags(lrt, n = Inf)
class(ttGlm)
## [1] "TopTags"
## attr(,"package")
## [1] "edgeR"
summary(deGlm <- decideTestsDGE(lrt, p = 0.1, adjust = "fdr"))
## group2
## Down 64
## NotSig 13390
## Up 159
tagsGlm <- rownames(dgeGlmTagDisp)[as.logical(deGlm)]
plotSmear(lrt, de.tags = tagsGlm)
hits2 <- ttGlm$table[ttGlm$table$FDR < 0.1, ]
write.csv(hits2, "Mouse_EdgeR_Results_Zero_vs_1")
columns(org.Mm.eg.db)
## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
## [6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME"
## [11] "GENETYPE" "GO" "GOALL" "IPI" "MGI"
## [16] "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM" "PMID"
## [21] "PROSITE" "REFSEQ" "SYMBOL" "UNIPROT"
ttGlm2 <- as.data.frame(ttGlm$table)
ttGlm2$symbol = mapIds(org.Mm.eg.db,
keys=row.names(ttGlm2),
column = "SYMBOL",
keytype = "ENSEMBL",
multiVals = "first")
## 'select()' returned 1:many mapping between keys and columns
ttGlm2$entrez = mapIds(org.Mm.eg.db,
keys=row.names(ttGlm2),
column = "ENTREZID",
keytype = "ENSEMBL",
multiVals = "first")
## 'select()' returned 1:many mapping between keys and columns
ttGlm2$name = mapIds(org.Mm.eg.db,
keys=row.names(ttGlm2),
column = "GENENAME",
keytype = "ENSEMBL",
multiVals = "first")
## 'select()' returned 1:many mapping between keys and columns
head(ttGlm2)
## logFC logCPM LR PValue FDR symbol entrez
## ENSMUSG00000026077 -1.36 3.6 80 4.3e-19 5.9e-15 Npas2 18143
## ENSMUSG00000007872 0.89 5.5 77 1.9e-18 1.3e-14 Id3 15903
## ENSMUSG00000021775 0.95 6.2 63 2.0e-15 9.1e-12 Nr1d2 353187
## ENSMUSG00000002250 -0.83 5.3 62 2.7e-15 9.2e-12 Ppard 19015
## ENSMUSG00000059824 2.26 4.6 58 2.6e-14 7.2e-11 Dbp 13170
## ENSMUSG00000074715 -1.99 3.8 47 7.0e-12 1.6e-08 Ccl28 56838
## name
## ENSMUSG00000026077 neuronal PAS domain protein 2
## ENSMUSG00000007872 inhibitor of DNA binding 3
## ENSMUSG00000021775 nuclear receptor subfamily 1, group D, member 2
## ENSMUSG00000002250 peroxisome proliferator activator receptor delta
## ENSMUSG00000059824 D site albumin promoter binding protein
## ENSMUSG00000074715 C-C motif chemokine ligand 28
library(pathview)
library(gage)
library(gageData)
data(kegg.sets.mm)
data(go.subs.mm)
foldchanges <- ttGlm2$logFC
names(foldchanges) <- ttGlm2$entrez
kegg.mm = kegg.gsets("mouse", id.type = "entrez")
kegg.mm.sigmet = kegg.mm$kg.sets[kegg.mm$sigmet.idx]
# Lets get the results
keggres = gage(foldchanges, gsets=kegg.mm.sigmet, same.dir = TRUE)
lapply(keggres, head)
## $greater
## p.geomean
## mmu03010 Ribosome 9.5e-05
## mmu04550 Signaling pathways regulating pluripotency of stem cells 2.0e-03
## mmu04330 Notch signaling pathway 6.1e-03
## mmu04350 TGF-beta signaling pathway 1.3e-02
## mmu04390 Hippo signaling pathway 2.0e-02
## mmu00830 Retinol metabolism 2.1e-02
## stat.mean
## mmu03010 Ribosome 3.8
## mmu04550 Signaling pathways regulating pluripotency of stem cells 2.9
## mmu04330 Notch signaling pathway 2.6
## mmu04350 TGF-beta signaling pathway 2.2
## mmu04390 Hippo signaling pathway 2.1
## mmu00830 Retinol metabolism 2.1
## p.val q.val
## mmu03010 Ribosome 9.5e-05 0.023
## mmu04550 Signaling pathways regulating pluripotency of stem cells 2.0e-03 0.235
## mmu04330 Notch signaling pathway 6.1e-03 0.488
## mmu04350 TGF-beta signaling pathway 1.3e-02 0.783
## mmu04390 Hippo signaling pathway 2.0e-02 0.826
## mmu00830 Retinol metabolism 2.1e-02 0.826
## set.size
## mmu03010 Ribosome 127
## mmu04550 Signaling pathways regulating pluripotency of stem cells 100
## mmu04330 Notch signaling pathway 54
## mmu04350 TGF-beta signaling pathway 84
## mmu04390 Hippo signaling pathway 125
## mmu00830 Retinol metabolism 37
## exp1
## mmu03010 Ribosome 9.5e-05
## mmu04550 Signaling pathways regulating pluripotency of stem cells 2.0e-03
## mmu04330 Notch signaling pathway 6.1e-03
## mmu04350 TGF-beta signaling pathway 1.3e-02
## mmu04390 Hippo signaling pathway 2.0e-02
## mmu00830 Retinol metabolism 2.1e-02
##
## $less
## p.geomean stat.mean p.val
## mmu04613 Neutrophil extracellular trap formation 0.00012 -3.7 0.00012
## mmu04145 Phagosome 0.00192 -2.9 0.00192
## mmu04110 Cell cycle 0.00276 -2.8 0.00276
## mmu04714 Thermogenesis 0.00472 -2.6 0.00472
## mmu04217 Necroptosis 0.00614 -2.5 0.00614
## mmu00910 Nitrogen metabolism 0.00867 -2.6 0.00867
## q.val set.size exp1
## mmu04613 Neutrophil extracellular trap formation 0.029 137 0.00012
## mmu04145 Phagosome 0.221 121 0.00192
## mmu04110 Cell cycle 0.221 134 0.00276
## mmu04714 Thermogenesis 0.283 208 0.00472
## mmu04217 Necroptosis 0.295 113 0.00614
## mmu00910 Nitrogen metabolism 0.347 13 0.00867
##
## $stats
## stat.mean
## mmu03010 Ribosome 3.8
## mmu04550 Signaling pathways regulating pluripotency of stem cells 2.9
## mmu04330 Notch signaling pathway 2.6
## mmu04350 TGF-beta signaling pathway 2.2
## mmu04390 Hippo signaling pathway 2.1
## mmu00830 Retinol metabolism 2.1
## exp1
## mmu03010 Ribosome 3.8
## mmu04550 Signaling pathways regulating pluripotency of stem cells 2.9
## mmu04330 Notch signaling pathway 2.6
## mmu04350 TGF-beta signaling pathway 2.2
## mmu04390 Hippo signaling pathway 2.1
## mmu00830 Retinol metabolism 2.1
greaters <- keggres$greater
lessers <- keggres$less
keggrespathways = data.frame(id = rownames(keggres$greater), keggres$greater) %>%
tbl_df() %>%
filter(row_number()<=5) %>%
.$id %>%
as.character()
## Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
## ℹ Please use `tibble::as_tibble()` instead.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
keggrespathways
## [1] "mmu03010 Ribosome"
## [2] "mmu04550 Signaling pathways regulating pluripotency of stem cells"
## [3] "mmu04330 Notch signaling pathway"
## [4] "mmu04350 TGF-beta signaling pathway"
## [5] "mmu04390 Hippo signaling pathway"
keggresids = substr(keggrespathways, start=1, stop = 8)
keggresids
## [1] "mmu03010" "mmu04550" "mmu04330" "mmu04350" "mmu04390"
plot_pathways = function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = "mouse", new.signature = FALSE)
tmp = sapply(keggresids, function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = "mouse", kegg.native = FALSE))
## 'select()' returned 1:1 mapping between keys and columns
## Note: mmu03010 not rendered, 0 or 1 connected nodes!
## Try "kegg.native=T" instead!
## 'select()' returned 1:1 mapping between keys and columns
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file mmu04550.pathview.pdf
## 'select()' returned 1:1 mapping between keys and columns
## Warning: reconcile groups sharing member nodes!
## [,1] [,2]
## [1,] "13" "74"
## [2,] "13" "75"
## [3,] "13" "76"
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file mmu04330.pathview.pdf
## 'select()' returned 1:1 mapping between keys and columns
## Warning: reconcile groups sharing member nodes!
## [,1] [,2]
## [1,] "213" "234"
## [2,] "214" "234"
## [3,] "215" "234"
## [4,] "216" "234"
## [5,] "213" "235"
## [6,] "214" "235"
## [7,] "215" "235"
## [8,] "216" "235"
## Warning: Nodes are not the same type, hence unable to combine!
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file mmu04350.pathview.pdf
## 'select()' returned 1:1 mapping between keys and columns
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file mmu04390.pathview.pdf
keggrespathways = data.frame(id = rownames(keggres$less), keggres$less) %>%
tbl_df() %>%
filter(row_number()<=5) %>%
.$id %>%
as.character()
## Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
## ℹ Please use `tibble::as_tibble()` instead.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
keggrespathways
## [1] "mmu04613 Neutrophil extracellular trap formation"
## [2] "mmu04145 Phagosome"
## [3] "mmu04110 Cell cycle"
## [4] "mmu04714 Thermogenesis"
## [5] "mmu04217 Necroptosis"
keggresids = substr(keggrespathways, start=1, stop = 8)
keggresids
## [1] "mmu04613" "mmu04145" "mmu04110" "mmu04714" "mmu04217"
plot_pathways = function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = "mouse", new.signature = FALSE)
tmp = sapply(keggresids, function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = "mouse", kegg.native = FALSE))
## 'select()' returned 1:1 mapping between keys and columns
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file mmu04613.pathview.pdf
## 'select()' returned 1:1 mapping between keys and columns
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file mmu04145.pathview.pdf
## 'select()' returned 1:1 mapping between keys and columns
## Warning: reconcile groups sharing member nodes!
## [,1] [,2]
## [1,] "9" "300"
## [2,] "9" "306"
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file mmu04110.pathview.pdf
## 'select()' returned 1:1 mapping between keys and columns
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file mmu04714.pathview.pdf
## 'select()' returned 1:1 mapping between keys and columns
## Warning: reconcile groups sharing member nodes!
## [,1] [,2]
## [1,] "226" "232"
## [2,] "228" "232"
## [3,] "226" "234"
## [4,] "228" "234"
## Info: Working in directory /home/student/Desktop/classroom/myfiles
## Info: Writing image file mmu04217.pathview.pdf
library(imager)
## Loading required package: magrittr
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
##
## Attaching package: 'imager'
## The following object is masked from 'package:magrittr':
##
## add
## The following object is masked from 'package:Biostrings':
##
## width
## The following object is masked from 'package:XVector':
##
## width
## The following objects are masked from 'package:oligoClasses':
##
## B, B<-
## The following objects are masked from 'package:IRanges':
##
## resize, width
## The following object is masked from 'package:S4Vectors':
##
## width
## The following object is masked from 'package:Biobase':
##
## channel
## The following object is masked from 'package:BiocGenerics':
##
## width
## The following object is masked from 'package:stringr':
##
## boundary
## The following object is masked from 'package:dplyr':
##
## where
## The following object is masked from 'package:tidyr':
##
## fill
## The following objects are masked from 'package:stats':
##
## convolve, spectrum
## The following object is masked from 'package:graphics':
##
## frame
## The following object is masked from 'package:base':
##
## save.image
#This line doesnt work for some reason
filenames <- list.files(path = "~/Desktop/classroom/myfiles/meganbarr789_gmail_com", pattern = ".*pathview.png")
all_iages <- lapply(filenames, load.image)
knitr::include_graphics(filenames)
Load the required libraries
library("DESeq2")
## Loading required package: GenomicRanges
##
## Attaching package: 'GenomicRanges'
## The following object is masked from 'package:magrittr':
##
## subtract
## Loading required package: SummarizedExperiment
## Loading required package: MatrixGenerics
## Loading required package: matrixStats
##
## Attaching package: 'matrixStats'
## The following objects are masked from 'package:Biobase':
##
## anyMissing, rowMedians
## The following object is masked from 'package:dplyr':
##
## count
##
## Attaching package: 'MatrixGenerics'
## The following objects are masked from 'package:matrixStats':
##
## colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
## colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
## colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
## colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
## colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
## colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
## colWeightedMeans, colWeightedMedians, colWeightedSds,
## colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
## rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
## rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
## rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
## rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
## rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
## rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
## rowWeightedSds, rowWeightedVars
## The following object is masked from 'package:Biobase':
##
## rowMedians
library("dplyr")
library("apeglm")
load in data
countData <- read.csv("GLDS-102_rna_seq_Unnormalized_Counts.csv")
colData <- read.csv("PHENO_DATA_mouse.csv", sep= ",", row.names = 1)
add leveling factors to colData
colData$group <- factor(colData$group, levels = c("0", "1"))
make sure there are the same number of individuals as well as groups
all(rownames(colData))%in% colnames(countData)
## Warning in all(rownames(colData)): coercing argument of type 'character' to
## logical
## [1] FALSE
round up the counts
countData %>%
mutate_if(is.numeric, ceiling)
countData[, 2:13] <- sapply(countData[,2:13], as.integer)
row.names(countData) <- countData[,1]
countData <- countData[2:13]
rownames(colData) == colnames(countData)
dds <- DESeqDataSetFromMatrix(countData = countData, colData = colData, design = ~group)
dds <- dds[rowSums(counts(dds)>2) >=4]
dds <- DESeq(dds)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
res <- results(dds)
resLFC <- lfcShrink(dds, coef = 2)
## using 'apeglm' for LFC shrinkage. If used in published research, please cite:
## Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
## sequence count data: removing the noise and preserving large differences.
## Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895
(resOrdered <- res[order(res$padj), ])
## log2 fold change (MLE): group 1 vs 0
## Wald test p-value: group 1 vs 0
## DataFrame with 22008 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSMUSG00000002250 1459.223 -0.926292 0.1112628 -8.32526 8.41430e-17
## ENSMUSG00000007872 1719.375 0.818829 0.1122820 7.29261 3.04014e-13
## ENSMUSG00000026077 437.035 -1.191812 0.1655873 -7.19748 6.13338e-13
## ENSMUSG00000040998 14579.593 -0.506307 0.0703771 -7.19421 6.28252e-13
## ENSMUSG00000021775 2804.923 0.842511 0.1233312 6.83129 8.41546e-12
## ... ... ... ... ... ...
## ENSMUSG00000118345 4.22314 -0.12097478 0.599072 -0.20193699 0.839966
## ENSMUSG00000118353 6.60578 0.56456713 0.481195 1.17326031 0.240691
## ENSMUSG00000118358 3.30902 -0.00273584 0.763559 -0.00358301 0.997141
## ENSMUSG00000118369 2.91657 -1.11623145 0.790702 -1.41169702 0.158039
## ENSMUSG00000118384 7.43136 0.23830798 0.489273 0.48706567 0.626212
## padj
## <numeric>
## ENSMUSG00000002250 1.24077e-12
## ENSMUSG00000007872 2.24149e-09
## ENSMUSG00000026077 2.31605e-09
## ENSMUSG00000040998 2.31605e-09
## ENSMUSG00000021775 2.48189e-08
## ... ...
## ENSMUSG00000118345 NA
## ENSMUSG00000118353 NA
## ENSMUSG00000118358 NA
## ENSMUSG00000118369 NA
## ENSMUSG00000118384 NA
summary(res)
##
## out of 22008 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up) : 325, 1.5%
## LFC < 0 (down) : 327, 1.5%
## outliers [1] : 15, 0.068%
## low counts [2] : 7247, 33%
## (mean count < 38)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
FLT_Vs_GC <- as.data.frame(res$log2FoldChange)
head(FLT_Vs_GC)
## res$log2FoldChange
## 1 0.0421
## 2 -0.1334
## 3 -0.0185
## 4 -0.0882
## 5 -0.0079
## 6 0.1136
plotMA(resLFC, ylim = c(-2,2))
pdf(file = "MA_Plot_FLT_vs_GC.pdf")
dev.off()
## png
## 2
save differential expression results to file.
write.csv(as.data.frame(resOrdered), file = "Mousse_DESeq.csv")
perform a custom transformation
dds <- estimateSizeFactors(dds)
se <- SummarizedExperiment(log2(counts(dds, normalize = TRUE)+1), colData = colData(dds))
pdf(file = "PCA_PLOT_FLT_vs_GC.pdf")
plotPCA(DESeqTransform(se), intgroup = "group")
## using ntop=500 top features by variance
load packages
library(AnnotationDbi)
library(org.Mm.eg.db)
columns(org.Mm.eg.db)
## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
## [6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME"
## [11] "GENETYPE" "GO" "GOALL" "IPI" "MGI"
## [16] "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM" "PMID"
## [21] "PROSITE" "REFSEQ" "SYMBOL" "UNIPROT"
foldchanges <- as.data.frame(res$log2FoldChange, row.names = row.names(res))
head(foldchanges)
## res$log2FoldChange
## ENSMUSG00000000001 0.0421
## ENSMUSG00000000028 -0.1334
## ENSMUSG00000000031 -0.0185
## ENSMUSG00000000037 -0.0882
## ENSMUSG00000000049 -0.0079
## ENSMUSG00000000056 0.1136
res$symbol = mapIds(org.Mm.eg.db,
keys = row.names(res),
column = "SYMBOL",
keytype = "ENSEMBL",
multiVals = "first")
## 'select()' returned 1:many mapping between keys and columns
res$entrez = mapIds(org.Mm.eg.db,
keys = row.names(res),
column = "ENTREZID",
keytype = "ENSEMBL",
multiVals = "first")
## 'select()' returned 1:many mapping between keys and columns
res$name = mapIds(org.Mm.eg.db,
keys = row.names(res),
column = "GENENAME",
keytype = "ENSEMBL",
multiVals = "first")
## 'select()' returned 1:many mapping between keys and columns
head(res, 10)
## log2 fold change (MLE): group 1 vs 0
## Wald test p-value: group 1 vs 0
## DataFrame with 10 rows and 9 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSMUSG00000000001 3132.35128 0.04214340 0.0436714 0.9650117 0.334539
## ENSMUSG00000000028 68.75801 -0.13342706 0.1565936 -0.8520597 0.394181
## ENSMUSG00000000031 21.05397 -0.01853142 0.2486477 -0.0745288 0.940590
## ENSMUSG00000000037 24.42314 -0.08817270 0.2982220 -0.2956613 0.767489
## ENSMUSG00000000049 3.24919 -0.00790342 0.9613572 -0.0082211 0.993441
## ENSMUSG00000000056 1424.88216 0.11355979 0.0777635 1.4603234 0.144201
## ENSMUSG00000000058 1420.78992 -0.03893850 0.0850602 -0.4577759 0.647113
## ENSMUSG00000000078 2254.53129 -0.07540275 0.0874314 -0.8624214 0.388456
## ENSMUSG00000000085 822.68179 0.04586772 0.0584699 0.7844667 0.432766
## ENSMUSG00000000088 5946.81754 -0.05461549 0.0691178 -0.7901795 0.429423
## padj symbol entrez name
## <numeric> <character> <character> <character>
## ENSMUSG00000000001 0.739637 Gnai3 14679 G protein subunit al..
## ENSMUSG00000000028 0.777085 Cdc45 12544 cell division cycle 45
## ENSMUSG00000000031 NA H19 14955 H19, imprinted mater..
## ENSMUSG00000000037 NA Scml2 107815 Scm polycomb group p..
## ENSMUSG00000000049 NA Apoh 11818 apolipoprotein H
## ENSMUSG00000000056 0.547527 Narf 67608 nuclear prelamin A r..
## ENSMUSG00000000058 0.901904 Cav2 12390 caveolin 2
## ENSMUSG00000000078 0.774294 Klf6 23849 Kruppel-like transcr..
## ENSMUSG00000000085 0.800018 Scmh1 29871 sex comb on midleg h..
## ENSMUSG00000000088 0.798420 Cox5a 12858 cytochrome c oxidase..
pathview packages
library(pathview)
library(gage)
library(gageData)
data("kegg.sets.mm")
data("go.subs.mm")
foldchanges <- res$log2FoldChange
names(foldchanges) = res$entrez
head(foldchanges)
## 14679 12544 14955 107815 11818 67608
## 0.0421 -0.1334 -0.0185 -0.0882 -0.0079 0.1136
kegg.mm <- kegg.gsets("mouse", id.type = "entrez")
kegg.mm.sigmet <- kegg.mm$kg.sets[kegg.mm$sigmet.idx]
map the results
keggres <- gage(foldchanges, gsets = kegg.mm.sigmet, same.dir = TRUE)
lapply(keggres, head)
## $greater
## p.geomean stat.mean p.val q.val
## mmu03010 Ribosome 0.017 2.1 0.017 0.89
## mmu04022 cGMP-PKG signaling pathway 0.030 1.9 0.030 0.89
## mmu04360 Axon guidance 0.038 1.8 0.038 0.89
## mmu04330 Notch signaling pathway 0.042 1.8 0.042 0.89
## mmu04658 Th1 and Th2 cell differentiation 0.054 1.6 0.054 0.89
## mmu02010 ABC transporters 0.075 1.5 0.075 0.89
## set.size exp1
## mmu03010 Ribosome 139 0.017
## mmu04022 cGMP-PKG signaling pathway 152 0.030
## mmu04360 Axon guidance 176 0.038
## mmu04330 Notch signaling pathway 58 0.042
## mmu04658 Th1 and Th2 cell differentiation 78 0.054
## mmu02010 ABC transporters 46 0.075
##
## $less
## p.geomean stat.mean p.val
## mmu04613 Neutrophil extracellular trap formation 0.0043 -2.6 0.0043
## mmu04110 Cell cycle 0.0062 -2.5 0.0062
## mmu04657 IL-17 signaling pathway 0.0113 -2.3 0.0113
## mmu04145 Phagosome 0.0332 -1.8 0.0332
## mmu04621 NOD-like receptor signaling pathway 0.0388 -1.8 0.0388
## mmu04625 C-type lectin receptor signaling pathway 0.0441 -1.7 0.0441
## q.val set.size exp1
## mmu04613 Neutrophil extracellular trap formation 0.75 160 0.0043
## mmu04110 Cell cycle 0.75 151 0.0062
## mmu04657 IL-17 signaling pathway 0.90 75 0.0113
## mmu04145 Phagosome 0.91 144 0.0332
## mmu04621 NOD-like receptor signaling pathway 0.91 154 0.0388
## mmu04625 C-type lectin receptor signaling pathway 0.91 99 0.0441
##
## $stats
## stat.mean exp1
## mmu03010 Ribosome 2.1 2.1
## mmu04022 cGMP-PKG signaling pathway 1.9 1.9
## mmu04360 Axon guidance 1.8 1.8
## mmu04330 Notch signaling pathway 1.8 1.8
## mmu04658 Th1 and Th2 cell differentiation 1.6 1.6
## mmu02010 ABC transporters 1.5 1.5
save greater and less than pathways
greaters <- keggres$greater
lessers <- keggres$less
keggrespathways <- data.frame(id = rownames(keggres$greater), keggres$greater) %>%
tbl_df() %>%
filter(row_number() <= 3) %>%
.$id %>%
as.character()
## Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
## ℹ Please use `tibble::as_tibble()` instead.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
keggrespathways
## [1] "mmu03010 Ribosome" "mmu04022 cGMP-PKG signaling pathway"
## [3] "mmu04360 Axon guidance"
keggresids <- substr(keggrespathways, start = 1, stop = 8)
keggresids
## [1] "mmu03010" "mmu04022" "mmu04360"
PLOT!
plot_pathway = function(pid) pathviiew(gene.data = foldchange, pathway.id = pid, specied = "mouse", new.signature = FALSE)
tmp = sapply(keggresids, function(pid) pathview(gene.data = foldchanges, pathway.id = pid, speceis = "mouse"))
## Info: Downloading xml files for hsammu03010, 1/1 pathways..
## Warning in download.file(xml.url, xml.target, quiet = T): cannot open URL
## 'https://rest.kegg.jp/get/hsammu03010/kgml': HTTP status was '404 Not Found'
## Warning: Download of hsammu03010 xml file failed!
## This pathway may not exist!
## Info: Downloading png files for hsammu03010, 1/1 pathways..
## Warning: Download of hsammu03010 png file failed!
## This pathway may not exist!
## Warning: Failed to download KEGG xml/png files, hsammu03010 skipped!
## Info: Downloading xml files for hsammu04022, 1/1 pathways..
## Warning in download.file(xml.url, xml.target, quiet = T): cannot open URL
## 'https://rest.kegg.jp/get/hsammu04022/kgml': HTTP status was '404 Not Found'
## Warning: Download of hsammu04022 xml file failed!
## This pathway may not exist!
## Info: Downloading png files for hsammu04022, 1/1 pathways..
## Warning: Download of hsammu04022 png file failed!
## This pathway may not exist!
## Warning: Failed to download KEGG xml/png files, hsammu04022 skipped!
## Info: Downloading xml files for hsammu04360, 1/1 pathways..
## Warning in download.file(xml.url, xml.target, quiet = T): cannot open URL
## 'https://rest.kegg.jp/get/hsammu04360/kgml': HTTP status was '404 Not Found'
## Warning: Download of hsammu04360 xml file failed!
## This pathway may not exist!
## Info: Downloading png files for hsammu04360, 1/1 pathways..
## Warning: Download of hsammu04360 png file failed!
## This pathway may not exist!
## Warning: Failed to download KEGG xml/png files, hsammu04360 skipped!
keggrespathways <- data.frame(id = rownames(keggres$less), keggres$less) %>%
tbl_df() %>%
filter(row_number() <= 3) %>%
.$id %>%
as.character()
## Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
## ℹ Please use `tibble::as_tibble()` instead.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
keggrespathways
## [1] "mmu04613 Neutrophil extracellular trap formation"
## [2] "mmu04110 Cell cycle"
## [3] "mmu04657 IL-17 signaling pathway"
keggresids <- substr(keggrespathways, start = 1, stop = 8)
keggresids
## [1] "mmu04613" "mmu04110" "mmu04657"
PLOT!
plot_pathway = function(pid) pathviiew(gene.data = foldchange, pathway.id = pid, specied = "mouse", new.signature = FALSE)
tmp = sapply(keggresids, function(pid) pathview(gene.data = foldchanges, pathway.id = pid, speceis = "mouse"))
## Info: Downloading xml files for hsammu04613, 1/1 pathways..
## Warning in download.file(xml.url, xml.target, quiet = T): cannot open URL
## 'https://rest.kegg.jp/get/hsammu04613/kgml': HTTP status was '404 Not Found'
## Warning: Download of hsammu04613 xml file failed!
## This pathway may not exist!
## Info: Downloading png files for hsammu04613, 1/1 pathways..
## Warning: Download of hsammu04613 png file failed!
## This pathway may not exist!
## Warning: Failed to download KEGG xml/png files, hsammu04613 skipped!
## Info: Downloading xml files for hsammu04110, 1/1 pathways..
## Warning in download.file(xml.url, xml.target, quiet = T): cannot open URL
## 'https://rest.kegg.jp/get/hsammu04110/kgml': HTTP status was '404 Not Found'
## Warning: Download of hsammu04110 xml file failed!
## This pathway may not exist!
## Info: Downloading png files for hsammu04110, 1/1 pathways..
## Warning: Download of hsammu04110 png file failed!
## This pathway may not exist!
## Warning: Failed to download KEGG xml/png files, hsammu04110 skipped!
## Info: Downloading xml files for hsammu04657, 1/1 pathways..
## Warning in download.file(xml.url, xml.target, quiet = T): cannot open URL
## 'https://rest.kegg.jp/get/hsammu04657/kgml': HTTP status was '404 Not Found'
## Warning: Download of hsammu04657 xml file failed!
## This pathway may not exist!
## Info: Downloading png files for hsammu04657, 1/1 pathways..
## Warning: Download of hsammu04657 png file failed!
## This pathway may not exist!
## Warning: Failed to download KEGG xml/png files, hsammu04657 skipped!
library(imager)
filenames <- list.files(path = "~/Desktop/classroom/myfiles/meganbarr789", pattern = ".*pathview.png")
all_images <- lapply(filenames, load.image)
knitr::include_graphics(filenames)
library(tidyverse)
EdgeR <- read.csv("Mouse_EdgeR_Results_Zero_vs_1")
DESeq <- read.csv("Mouse_DESeq.csv")
DESeq2 <- DESeq %>%
filter(padj < 0.05)
DESeq2 <- DESeq2[,c(1,3)]
EdgeR <- EdgeR[, 1:2]
colnames(DESeq2) <- c("ID", "logFC" )
colnames(EdgeR) <- c("ID", "logFC")
install.packages("GOplot")
## Installing package into '/home/student/R/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
library(GOplot)
## Loading required package: ggdendro
##
## Attaching package: 'ggdendro'
## The following object is masked from 'package:imager':
##
## label
## Loading required package: gridExtra
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:Biobase':
##
## combine
## The following object is masked from 'package:BiocGenerics':
##
## combine
## The following object is masked from 'package:dplyr':
##
## combine
## Loading required package: RColorBrewer
comp <- GOVenn(DESeq2, EdgeR, label = c("DESeq1", "EdgeR"), title = "comparison of DESeq and EdgeR DE Genes", plot = FALSE)
comp$plot
library(msa)
seq <- readAAStringSet("hglobin.fa")
alignments <- msa(seq, method = "ClustalW")
## use default substitution matrix
alignments
## CLUSTAL 2.1
##
## Call:
## msa(seq, method = "ClustalW")
##
## MsaAAMultipleAlignment with 3 rows and 142 columns
## aln names
## [1] MVLSPADKTNVKAAWGKVGAHAGEYG...FTPAVHASLDKFLASVSTVLTSKYR HBA_HUMAN
## [2] MVLSGEDKSNIKAAWGKIGGHGAEYG...FTPAVHASLDKFLASVSTVLTSKYR HBA_MOUSE
## [3] MSLTRTERTIILSLWSKISTQADVIG...FTADAHAAWDKFLSIVSGVLTEKYR HBAZ_CAPHI
## Con MVLS??DKTNIKAAWGKIG?HA?EYG...FTPAVHASLDKFLASVSTVLTSKYR Consensus
msaPrettyPrint(alignments, output = "pdf", showNames = "left",
showLogo = "none", askForOverwrite = FALSE,
verbose = TRUE, file = "whole_aligh.pdf")
## Multiple alignment written to temporary file /tmp/RtmpBHBxcE/seqa907e627683.fasta
## File whole_aligh.tex created
## Warning in texi2dvi(texfile, quiet = !verbose, pdf = identical(output, "pdf"),
## : texi2dvi script/program not available, using emulation
## Output file whole_aligh.pdf created
msaPrettyPrint(alignments, c(10,30), output = "pdf", showNames = "left",
file = "Zoomed_align.pdf", showLogo = "top", askForOverwrite = FALSE,
verbose = TRUE)
## Multiple alignment written to temporary file /tmp/RtmpBHBxcE/seqa907c2695c7.fasta
## File Zoomed_align.tex created
## Warning in texi2dvi(texfile, quiet = !verbose, pdf = identical(output, "pdf"),
## : texi2dvi script/program not available, using emulation
## Output file Zoomed_align.pdf created
Lets make a tree from out alignment
library(ape)
##
## Attaching package: 'ape'
## The following object is masked from 'package:imager':
##
## where
## The following object is masked from 'package:Biostrings':
##
## complement
## The following object is masked from 'package:dplyr':
##
## where
library(seqinr)
##
## Attaching package: 'seqinr'
## The following objects are masked from 'package:ape':
##
## as.alignment, consensus
## The following object is masked from 'package:matrixStats':
##
## count
## The following object is masked from 'package:Biostrings':
##
## translate
## The following object is masked from 'package:limma':
##
## zscore
## The following object is masked from 'package:dplyr':
##
## count
alignment_seqinr <- msaConvert(alignments, type = "seqinr::alignment")
distances1 <- seqinr::dist.alignment(alignment_seqinr, "identity")
tree <- ape::nj(distances1)
plot(tree,main = "Phylogenetic Tree of HBA sequences")
library(DECIPHER)
## Loading required package: RSQLite
## Loading required package: parallel
long_seq <- readDNAStringSet("plastid_genomes.fa")
long_seq
## DNAStringSet object of length 5:
## width seq names
## [1] 130584 GGCATAAGCTATCTTCCCAAAGG...ATGATTCAAACATAAAAGTCCT NC_018523.1 Sacch...
## [2] 161592 ATGGGCGAACGACGGGAATTGAA...AAAGAAAAAAAAATAGGAGTAA NC_022431.1 Ascle...
## [3] 117672 ATGAGTACAACTCGAAAGTCCAC...TTGATTTCATCCACAAACGAAC NC_022259.1 Nanno...
## [4] 154731 TTATCCATTTGTAGATGGAACTT...CATATACACTAAGACAAAAGTC NC_022417.1 Cocos...
## [5] 156618 GGGCGAACGACGGGAATTGAACC...CCTTTTGTAGCGAATCCGTTAT NC_022459.1 Camel...
Temporary SQLite database
Seqs2DB(long_seq, "XStringSet", "long_db", names(long_seq))
## Adding 5 sequences to the database.
##
## Added 5 new sequences to table Seqs.
## 65 total sequences in table Seqs.
## Time difference of 0.21 secs
synteny <- FindSynteny("long_db")
## ================================================================================
##
## Time difference of 9.8 secs
View blocs with plotting
pairs(synteny)
plot(synteny)
alignment file
alignment <- AlignSynteny(synteny, "long_db")
## ================================================================================
##
## Time difference of 88 secs
block <- unlist(alignment[[1]])
writeXStringSet(block, "genome_blocks_out.fa")
library(locfit)
## locfit 1.5-9.8 2023-06-11
##
## Attaching package: 'locfit'
## The following object is masked from 'package:purrr':
##
## none
library(Rsamtools)
get_annotated_regions_from_gff <- function(file_name) {
gff <- rtracklayer::import.gff(file_name)
as(gff, "GRanges")
}
whole_genome <- csaw::windowCounts(
# file.path(~Desktop/classroom/myfiles)
file.path( "windows.bam"),
bin = TRUE,
filter = 0,
width = 500,
param = csaw::readParam(
minq = 20, # determines what is a passing read
dedup = TRUE, # removes pcr duplicates
pe = "both" # requires that both pairs of reads are aligned
)
)
colnames(whole_genome) <- c("small_data")
annotated_regions <- get_annotated_regions_from_gff(("genes.gff"))
library(ggplot2)
library(ggtree)
## ggtree v3.10.0 For help: https://yulab-smu.top/treedata-book/
##
## If you use the ggtree package suite in published research, please cite
## the appropriate paper(s):
##
## Guangchuang Yu, David Smith, Huachen Zhu, Yi Guan, Tommy Tsan-Yuk Lam.
## ggtree: an R package for visualization and annotation of phylogenetic
## trees with their covariates and other associated data. Methods in
## Ecology and Evolution. 2017, 8(1):28-36. doi:10.1111/2041-210X.12628
##
## S Xu, Z Dai, P Guo, X Fu, S Liu, L Zhou, W Tang, T Feng, M Chen, L
## Zhan, T Wu, E Hu, Y Jiang, X Bo, G Yu. ggtreeExtra: Compact
## visualization of richly annotated phylogenetic data. Molecular Biology
## and Evolution. 2021, 38(9):4039-4042. doi: 10.1093/molbev/msab166
##
## G Yu. Data Integration, Manipulation and Visualization of Phylogenetic
## Trees (1st ed.). Chapman and Hall/CRC. 2022. ISBN: 9781032233574
##
## Attaching package: 'ggtree'
## The following object is masked from 'package:ape':
##
## rotate
## The following object is masked from 'package:magrittr':
##
## inset
## The following object is masked from 'package:reshape':
##
## expand
## The following object is masked from 'package:Biostrings':
##
## collapse
## The following object is masked from 'package:IRanges':
##
## collapse
## The following object is masked from 'package:S4Vectors':
##
## expand
## The following object is masked from 'package:tidyr':
##
## expand
library(treeio)
## treeio v1.26.0 For help: https://yulab-smu.top/treedata-book/
##
## If you use the ggtree package suite in published research, please cite
## the appropriate paper(s):
##
## LG Wang, TTY Lam, S Xu, Z Dai, L Zhou, T Feng, P Guo, CW Dunn, BR
## Jones, T Bradley, H Zhu, Y Guan, Y Jiang, G Yu. treeio: an R package
## for phylogenetic tree input and output with richly annotated and
## associated data. Molecular Biology and Evolution. 2020, 37(2):599-603.
## doi: 10.1093/molbev/msz240
##
## Guangchuang Yu. Using ggtree to visualize data on tree-like structures.
## Current Protocols in Bioinformatics. 2020, 69:e96. doi:10.1002/cpbi.96
##
## Guangchuang Yu. Data Integration, Manipulation and Visualization of
## Phylogenetic Trees (1st edition). Chapman and Hall/CRC. 2022,
## doi:10.1201/9781003279242
##
## Attaching package: 'treeio'
## The following object is masked from 'package:seqinr':
##
## read.fasta
## The following object is masked from 'package:Biostrings':
##
## mask
itol <- ape::read.tree("itol.nwk")
ggtree(itol)
ggtree(itol, layout = "circular")
ggtree(itol) + coord_flip() + scale_x_reverse()
ggtree(itol) + geom_tiplab(color = "indianred", size = 1.5)
ggtree(itol, layour = "unrooted") + geom_strip(13, 14, color = "red", barsize = 1)
## Warning in stat_tree(data = data, mapping = mapping, geom = "segment", position = position, : Ignoring unknown parameters: `layour`
## Ignoring unknown parameters: `layour`
ggtree(itol, layout = "unrooted") + geom_hilight(node = 11, type = "encircle", fill = "steelblue")
## "daylight" method was used as default layout for unrooted tree.
## Average angle change [1] 0.174910612627282
## Average angle change [2] 0.161645191380673
## Average angle change [3] 0.129304375923319
## Average angle change [4] 0.0825706767962636
## Average angle change [5] 0.100056259084497
install.packages('BAMMtools')
## Installing package into '/home/student/R/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
library(BAMMtools)
getmrca(itol, "Photorhabdus_luminescens", "Blochmannia_floridanus")
## [1] 206
ggtree(itol, layout = "unrooted") + geom_hilight(node = 206, type = "encircle", fill = "steelblue")
## "daylight" method was used as default layout for unrooted tree.
## Average angle change [1] 0.174910612627282
## Average angle change [2] 0.161645191380673
## Average angle change [3] 0.129304375923319
## Average angle change [4] 0.0825706767962636
## Average angle change [5] 0.100056259084497
library(ape)
library(adegraphics)
## Registered S3 methods overwritten by 'adegraphics':
## method from
## biplot.dudi ade4
## kplot.foucart ade4
## kplot.mcoa ade4
## kplot.mfa ade4
## kplot.pta ade4
## kplot.sepan ade4
## kplot.statis ade4
## scatter.coa ade4
## scatter.dudi ade4
## scatter.nipals ade4
## scatter.pco ade4
## score.acm ade4
## score.mix ade4
## score.pca ade4
## screeplot.dudi ade4
##
## Attaching package: 'adegraphics'
## The following object is masked from 'package:ape':
##
## zoom
## The following object is masked from 'package:GenomicRanges':
##
## score
## The following object is masked from 'package:Biostrings':
##
## score
## The following object is masked from 'package:BiocGenerics':
##
## score
library(treespace)
## Loading required package: ade4
##
## Attaching package: 'ade4'
## The following objects are masked from 'package:adegraphics':
##
## kplotsepan.coa, s.arrow, s.class, s.corcircle, s.distri, s.image,
## s.label, s.logo, s.match, s.traject, s.value, table.value,
## triangle.class
## The following object is masked from 'package:GenomicRanges':
##
## score
## The following object is masked from 'package:Biostrings':
##
## score
## The following object is masked from 'package:BiocGenerics':
##
## score
treefiles <- list.files(file.path(getwd(), "gene_trees"), full.names = TRUE)
tree_list <- lapply(treefiles, read.tree)
class(tree_list)
## [1] "list"
class(tree_list) <- "multiPhylo"
class(tree_list)
## [1] "multiPhylo"
comparisons <- treespace(tree_list, nf = 3)
table.image(comparisons$D, nclass= 25)
plotGroves(comparisons$pco, lab.show = TRUE, lab.cex = 1.5)
groves <- findGroves(comparisons, nclust = 4)
plotGroves(groves)
library(ape)
newick <- read.tree("mammal_tree.nwk")
l <- subtrees(newick)
plot(newick)
plot(l[[4]], sub = "Node 4")
small_tree <- extract.clade(newick, 9)
plot(small_tree)
new_tree <- bind.tree(newick, small_tree, 3)
plot(new_tree)
library(Biostrings)
library(msa)
library(phangorn)
seqs <- readAAStringSet("abc.fa")
aln <- msa::msa(seqs, method = c("ClustalOmega"))
## using Gonnet
aln <- as.phyDat(aln, type = "AA")
class(aln)
## [1] "phyDat"
dist_mat <- dist.ml(aln)
upgma_tree <- upgma(dist_mat)
plot(upgma_tree, mian = "UPGMA")
## Warning in plot.window(...): "mian" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "mian" is not a graphical parameter
## Warning in title(...): "mian" is not a graphical parameter
nj_tree <- NJ(dist_mat)
plot(nj_tree, main = "NJ")
fit <- pml(nj_tree, aln)
bootstraps <- bootstrap.phyDat(aln, FUN = function(x) (NJ(dist.ml(x))), bs = 100)
plotBS(nj_tree, bootstraps, p =10)
library(GenomicRanges)
library(gmapR)
library(VariantAnnotation)
##
## Attaching package: 'VariantAnnotation'
## The following object is masked from 'package:stringr':
##
## fixed
## The following object is masked from 'package:base':
##
## tabulate
library(VariantTools)
bam_file <- file.path(getwd(), "hg17_snps.bam")
fasta_file <- file.path(getwd(), "chr17_83k.fa")
fa <- rtracklayer::FastaFile(fasta_file)
genome <- gmapR::GmapGenome(fa, create = TRUE)
## Creating directory /home/student/.local/share/gmap
qual_params <- TallyVariantsParam(
genome = genome,
minimum_mapq = 20)
var_params <- VariantCallingFilters(read.count = 19, p.lower = 0.01)
called_variants <- callVariants(bam_file,
qual_params,
calling.filter = var_params)
head(called_variants)
## VRanges object with 6 ranges and 17 metadata columns:
## seqnames ranges strand ref alt totalDepth
## <Rle> <IRanges> <Rle> <character> <characterOrRle> <integerOrRle>
## [1] NC_000017.10 64 * G T 759
## [2] NC_000017.10 69 * G T 812
## [3] NC_000017.10 70 * G T 818
## [4] NC_000017.10 73 * T A 814
## [5] NC_000017.10 77 * T A 802
## [6] NC_000017.10 78 * G T 798
## refDepth altDepth sampleNames softFilterMatrix | n.read.pos
## <integerOrRle> <integerOrRle> <factorOrRle> <matrix> | <integer>
## [1] 739 20 <NA> | 17
## [2] 790 22 <NA> | 19
## [3] 796 22 <NA> | 20
## [4] 795 19 <NA> | 13
## [5] 780 22 <NA> | 19
## [6] 777 21 <NA> | 17
## n.read.pos.ref raw.count.total count.plus count.plus.ref count.minus
## <integer> <integer> <integer> <integer> <integer>
## [1] 64 759 20 739 0
## [2] 69 812 22 790 0
## [3] 70 818 22 796 0
## [4] 70 814 19 795 0
## [5] 70 802 22 780 0
## [6] 70 798 21 777 0
## count.minus.ref count.del.plus count.del.minus read.pos.mean
## <integer> <integer> <integer> <numeric>
## [1] 0 0 0 30.9000
## [2] 0 0 0 40.7273
## [3] 0 0 0 34.7727
## [4] 0 0 0 36.1579
## [5] 0 0 0 38.3636
## [6] 0 0 0 39.7143
## read.pos.mean.ref read.pos.var read.pos.var.ref mdfne mdfne.ref
## <numeric> <numeric> <numeric> <numeric> <numeric>
## [1] 32.8755 318.558 347.804 NA NA
## [2] 35.4190 377.004 398.876 NA NA
## [3] 36.3442 497.762 402.360 NA NA
## [4] 36.2176 519.551 402.843 NA NA
## [5] 36.0064 472.327 397.070 NA NA
## [6] 35.9241 609.076 390.463 NA NA
## count.high.nm count.high.nm.ref
## <integer> <integer>
## [1] 20 738
## [2] 22 789
## [3] 22 796
## [4] 19 769
## [5] 22 780
## [6] 21 777
## -------
## seqinfo: 1 sequence from chr17_83k genome
## hardFilters(4): nonRef nonNRef readCount likelihoodRatio
get_annotated_regions_from_gff <- function(file_name) {
gff <- rtracklayer::import.gff(file_name)
as(gff,"GRanges")
}
genes <- get_annotated_regions_from_gff("chr17.83k.gff3")
overlaps <- GenomicRanges::findOverlaps(called_variants, genes)
overlaps
## Hits object with 12684 hits and 0 metadata columns:
## queryHits subjectHits
## <integer> <integer>
## [1] 35176 1
## [2] 35176 2
## [3] 35176 3
## [4] 35177 1
## [5] 35177 2
## ... ... ...
## [12680] 40944 2
## [12681] 40944 7
## [12682] 40945 1
## [12683] 40945 2
## [12684] 40945 7
## -------
## queryLength: 44949 / subjectLength: 8
identified <- genes[subjectHits(overlaps)]
library(Biostrings)
library(systemPipeR)
## Loading required package: ShortRead
## Loading required package: BiocParallel
## Loading required package: GenomicAlignments
##
## Attaching package: 'GenomicAlignments'
## The following object is masked from 'package:dplyr':
##
## last
##
## Attaching package: 'ShortRead'
## The following object is masked from 'package:adegraphics':
##
## zoom
## The following objects are masked from 'package:locfit':
##
## left, right
## The following object is masked from 'package:ape':
##
## zoom
## The following object is masked from 'package:imager':
##
## clean
## The following object is masked from 'package:magrittr':
##
## functions
## The following object is masked from 'package:oligo':
##
## intensity
## The following objects are masked from 'package:oligoClasses':
##
## chromosome, position
## The following object is masked from 'package:affy':
##
## intensity
## The following object is masked from 'package:dplyr':
##
## id
## The following object is masked from 'package:purrr':
##
## compose
## The following object is masked from 'package:tibble':
##
## view
##
## Attaching package: 'systemPipeR'
## The following object is masked from 'package:VariantAnnotation':
##
## reference
## The following object is masked from 'package:DESeq2':
##
## results
dna_object <- readDNAStringSet("arabidopsis_chloroplast.fa")
predict_orfs <- predORF(dna_object, n = 'all', type = 'gr', mode = 'ORF', strand = 'both',
longest_disjoint = TRUE)
predict_orfs
## GRanges object with 2501 ranges and 2 metadata columns:
## seqnames ranges strand | subject_id inframe2end
## <Rle> <IRanges> <Rle> | <integer> <numeric>
## 1 chloroplast 86762-93358 + | 1 2
## 1162 chloroplast 2056-2532 - | 1 3
## 2 chloroplast 72371-73897 + | 2 2
## 1163 chloroplast 77901-78362 - | 2 1
## 3 chloroplast 54937-56397 + | 3 3
## ... ... ... ... . ... ...
## 2497 chloroplast 129757-129762 - | 1336 3
## 2498 chloroplast 139258-139263 - | 1337 3
## 2499 chloroplast 140026-140031 - | 1338 3
## 2500 chloroplast 143947-143952 - | 1339 3
## 2501 chloroplast 153619-153624 - | 1340 3
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
bases <- c("A", "T", "G", "C")
raw_seq_string <- strsplit(as.character(dna_object), "")
seq_length <- width(dna_object[1])
counts <- lapply(bases, function(x) {sum(grepl(x, raw_seq_string))})
probs <- unlist(lapply(counts, function(base_count){signif(base_count/seq_length,2)}))
probs
## [1] 6.5e-06 6.5e-06 6.5e-06 6.5e-06
get_longest_orf_in_random_genome <- function(x,
length = 1000,
probs = c(0.25, 0.25, 0.25, 0.25),
bases = c("A", "T", "G", "C")){
random_genome <- paste0(sample(bases, size = length, replace = TRUE, prob = probs), collapse = "")
random_dna_object <- DNAStringSet(random_genome)
names(random_dna_object) <-("random_dna_string")
orfs <- predORF(random_dna_object, n =1, type = 'gr', mode = 'ORF', strand = 'both', longest_disjoint = TRUE)
return(max(width(orfs)))
}
Use the function just created to predict the ORFs in 10 rando genomes
random_lengths <- unlist(lapply(1:10, get_longest_orf_in_random_genome, length = seq_length, probs = probs, bases = bases))
Pull out the longest length from 10 simulations
longest_random_orf <- max(random_lengths)
only keep the frames that are longer in chloroplast genome
keep <- width(predict_orfs) > longest_random_orf
orfs_to_keep <- predict_orfs[keep]
orfs_to_keep
## GRanges object with 9 ranges and 2 metadata columns:
## seqnames ranges strand | subject_id inframe2end
## <Rle> <IRanges> <Rle> | <integer> <numeric>
## 1 chloroplast 86762-93358 + | 1 2
## 2 chloroplast 72371-73897 + | 2 2
## 3 chloroplast 54937-56397 + | 3 3
## 4 chloroplast 57147-58541 + | 4 1
## 5 chloroplast 33918-35141 + | 5 1
## 6 chloroplast 32693-33772 + | 6 2
## 7 chloroplast 109408-110436 + | 7 3
## 8 chloroplast 114461-115447 + | 8 2
## 9 chloroplast 141539-142276 + | 9 2
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
write this data to file
extracted_orfs <- BSgenome::getSeq(dna_object, orfs_to_keep)
names(extracted_orfs) <- paste0("orf_", 1:length(orfs_to_keep))
writeXStringSet(extracted_orfs, "saved_orfs.fa")
library(karyoploteR)
## Loading required package: regioneR
library(GenomicRanges)
setwd("~/Desktop/classroom/myfiles/meganbarr789")
genome_df <- data.frame(
chr = paste0("chr", 1:5),
start = rep(1, 5),
end = c(34964571, 22037565, 25499034, 20862711, 31270811)
)
Convert the dataframe to a granges object
genome_gr <- makeGRangesFromDataFrame(genome_df)
Create soe snp positions to map out.
snp_pos <- sample(1:1e7, 25)
snps <- data.frame(
chr = paste0("chr", sample(1:5,25, replace = TRUE)),
start = snp_pos,
end = snp_pos
)
convert the dataframe to granges
snps_gr <- makeGRangesFromDataFrame(snps)
create snp labels
snp_labels <- paste0("snp_", 1:25)
set margins for the plot
plot.params <- getDefaultPlotParams(plot.type = 1)
plot.params$dataloutmargin <- 600
Plot the snps
kp <- plotKaryotype(genome= genome_gr, plot.type = 1, plot.params = plot.params)
## No predefined canonical chromosomes found for the requested genome. Applying a heuristic chromosome filtering.
## To get the unfiltered genome, please set chromosomes="all" in the plotKaryotype call
kpPlotMarkers(kp, snps_gr, labels = snp_labels)
Add numeric data to the plot plot to 100 windows on chromosome 4
numeric_data <- data.frame(
y = rnorm(100, mean = 1, sd = 0.5),
chr = rep("chr4", 100),
start = seq(1,20862711, 20862711/100),
end = seq(1,20682711, 20862711/100)
)
make the data a granges object
numeric_data_gr <- makeGRangesFromDataFrame(numeric_data)
plot.params <- getDefaultPlotParams(plot.type = 2)
plot.params$data1outmargin <- 800
plot.params$data2outmargin <- 800
plot.params$topmargin <- 800
kp <- plotKaryotype(genome = genome_gr, plot.type = 1, plot.params = plot.params)
## No predefined canonical chromosomes found for the requested genome. Applying a heuristic chromosome filtering.
## To get the unfiltered genome, please set chromosomes="all" in the plotKaryotype call
kpPlotMarkers(kp, snps_gr, labels = snp_labels)
kpLines(kp, numeric_data_gr, y = numeric_data$y, data.panel = 2)