R Markdown Tutorial

this is a tutorial on how to use R markdown for reproducible research.

Here we can type long passages or descriptions of our data without the need of “hashtaging” our comments with the # symbol. In our first example we will be using the ToothGrowth dataset. In this experiemnt Guinea pigs were given different amounts of vitaimin C to see the affects on the animals teeth growth

To run R code in a markdown file, we need to denote the section that is considered R code. We call these “code chunks”

Below is a code chunk:

Toothdata <- ToothGrowth

head(Toothdata)

##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5

As you can see, from running the “play” button results in the code chunk printed inline of the r markdown file.

fit <- lm(len ~ dose, data = Toothdata)

b <- fit$coefficients

plot(len ~ dose, data = Toothdata)

abline(lm(len ~ dose, data = Toothdata))

The Tooth Growth of Guniea Pigs When Given Variable Ammounts of Vitamins

the slope of the regression line is 9.7635714,

section headers

we can also put sections and subsections in our r markdown file, similar to numbers or bullet points in a word document. This is done with the “#” that we previously used to denote text in an R script.

First level header

second level header

third level header

make sure you put a space after the hashtag, otherwise it will not work!

we can also add bullet points-type marks in our r markdown file.

one item
one item
one item
- oone more item
- one more item
- one more item
  - one last indentation

its important to note here that r markdown indentation matters!

first item
second item
third item

subitem 1
subitem 2
subitem 3

block quotes

we can put really nice quootes into the markdown document. We do this by using the “>” symbool.

“Genes are like the story, and DNA is the language that the story is written in.”

— Sam Kean

“If you ain’t first you’re last”

— Ricky Bobby

hyperlinks

Hyperlinks can also be incorperated into these files. This is especially useful in HTML files, since they are in a web browser anf will redirect the reader to the material that you are interested in showing the. Here we will use the link to R markdown’s homepage for this example. RMarkdown

Formulas

We can also put nice formated formulas into Markdown using two dollar signs.

Hard-Weinberg Formula

\[p^2 + 2pq + q^2 = 1\] and you can get really complex as well

\[\Theta = \begin{pmatrix}\alpha & \beta\\ \gamma & \delta \end{pmatrix}\]

Code Chunks

Code Chunk Options

There are alsso options for your R markdown file on how Knitr interprits the code chunk. There are the following options.

Eval (T or F): whether or not to evaluate the code chunk

Echo (T or F): whether or not to show the code for the chunk, but results will still print.

cache: if you enable, the same code chunk will not be evaluated the next time that the knitr is run. Great for code that has LONG run times.

fig.width or fig.height: the (graphical device) size of the R plots in inches. The figures are first written to the knitr document then to files that are saved separately.

out.width or out.height: The output size of the R plots IN THE DOCUMENT.

fig.cap: the words for the figure caption

we can also add a table of contents to out HTML Document. We do this by altering the YAML code ( the weird code chunk at the very top of the document.) We can add this: title: “Final Project BISC 516” author: “Megan Barr” date: “2024-11-11” output: html_document: toc: true toc_float: true

This will give us a floating table of contents on the right hand side of the document.

Tabs

You can also add TABS in our report. To do this you need to specify each section that you want to become a tab by placing “{.tabset}” after the line. Every subsequent header will be a new tab.

Themes

you can also add themes to your HTML document that change the highlighting color and hyperlink color of your HTML output. This can be nice aestheticlly To do this, you can change yoour theme in te YAML to one of the following.

Hyperlink color

Cerulean
Journal
Flatly
Readable
spacelab
united
cosmo
lumen
paper
sandstone
simplex
Yeti
null

You can alsso change the color by specifying highlights {.tabset}

Highlight color
defualt
tango
payments
kate
monochrome
espresso
zenburn
haddock
textnate

Data wrangling

Load libraries and data

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(nycflights13)

my_data <- nycflights13::flights

head(my_data)

## # A tibble: 6 × 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     1     1      517            515         2      830            819
## 2  2013     1     1      533            529         4      850            830
## 3  2013     1     1      542            540         2      923            850
## 4  2013     1     1      544            545        -1     1004           1022
## 5  2013     1     1      554            600        -6      812            837
## 6  2013     1     1      554            558        -4      740            728
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

first we will just look at the data on October 14th.

## # A tibble: 987 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    10    14      451            500        -9      624            648
##  2  2013    10    14      511            517        -6      733            757
##  3  2013    10    14      536            545        -9      814            855
##  4  2013    10    14      540            545        -5      932            933
##  5  2013    10    14      548            545         3      824            827
##  6  2013    10    14      549            600       -11      719            730
##  7  2013    10    14      552            600        -8      650            659
##  8  2013    10    14      553            600        -7      646            700
##  9  2013    10    14      554            600        -6      836            829
## 10  2013    10    14      555            600        -5      832            855
## # ℹ 977 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

If we want to subset this into a new variable, we do the following:

oct_14_flight <- filter(my_data, month == 10, day ==14)

what if you want to do both print and save the variable?

(oct_14_flight_2 <- filter(my_data, month == 10, day ==14))

## # A tibble: 987 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    10    14      451            500        -9      624            648
##  2  2013    10    14      511            517        -6      733            757
##  3  2013    10    14      536            545        -9      814            855
##  4  2013    10    14      540            545        -5      932            933
##  5  2013    10    14      548            545         3      824            827
##  6  2013    10    14      549            600       -11      719            730
##  7  2013    10    14      552            600        -8      650            659
##  8  2013    10    14      553            600        -7      646            700
##  9  2013    10    14      554            600        -6      836            829
## 10  2013    10    14      555            600        -5      832            855
## # ℹ 977 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

If you want to filter based on differnt opperators, you can use the following:

Equals ==
not equal to !=
greater than >
less than <
greater than or equal to >=
less than or equal to <+

(flight_through_september <- filter(my_data, month < 10))

## # A tibble: 252,484 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ℹ 252,474 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

If we don’t use the == to mean equals, we get this:

(oct_14_flight_2 <- filter(my_data, month = 10, day =14))

You can also use logical opperators to be more selective

and &
or |
not !

Let’s use the “or” functioon to pick flights in march and april

March_April_Flights <- filter(my_data, month == 3 | month ==4)

March_4th_Flights <- filter(my_data, month == 3 & day ==4)

Non_jan_flights <- filter(my_data, month != 1)

Arrange

Arrange allows us to arrange the dataset based on the variables we desire.

arrange(my_data, year, day, month)

## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

We can also do this in descending fashion

descending <- arrange(my_data, desc(year), desc(day), desc(month))

misssing values are always placed at the end of the dataframe regardless of ascending or descending.

Select

We can also select specfic columns that we wnant to looka at.

calander <- select(my_data, year, month, day)
print(calander)

## # A tibble: 336,776 × 3
##     year month   day
##    <int> <int> <int>
##  1  2013     1     1
##  2  2013     1     1
##  3  2013     1     1
##  4  2013     1     1
##  5  2013     1     1
##  6  2013     1     1
##  7  2013     1     1
##  8  2013     1     1
##  9  2013     1     1
## 10  2013     1     1
## # ℹ 336,766 more rows

We can also look at a range of columns

calender2 <- select(my_data, year:day)

Let’s look at all columns months through carrier

calender3 <- select(my_data, year:carrier)

We can also chooe which NOT to include

everything_else <- select(my_data, -(year:day))

In this instance we can also use the “not” opperator !

everything_else_2 <- select(my_data, !(year:day))

There are also some other helper functions that can help you select the columns or data you’re looking for

starts_with(“xyz”) – will select the values that start with xyz
ends_with(“xyz”) – will select the values that end with xyz
contains(“xyz”) – will select the values that contain xyz
matches (“xyz”) – will match the identical value xyz

Renaming

head(my_data)

## # A tibble: 6 × 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     1     1      517            515         2      830            819
## 2  2013     1     1      533            529         4      850            830
## 3  2013     1     1      542            540         2      923            850
## 4  2013     1     1      544            545        -1     1004           1022
## 5  2013     1     1      554            600        -6      812            837
## 6  2013     1     1      554            558        -4      740            728
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

rename(my_data, departure_time = dep_time)

## # A tibble: 336,776 × 19
##     year month   day departure_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>          <int>          <int>     <dbl>    <int>
##  1  2013     1     1            517            515         2      830
##  2  2013     1     1            533            529         4      850
##  3  2013     1     1            542            540         2      923
##  4  2013     1     1            544            545        -1     1004
##  5  2013     1     1            554            600        -6      812
##  6  2013     1     1            554            558        -4      740
##  7  2013     1     1            555            600        -5      913
##  8  2013     1     1            557            600        -3      709
##  9  2013     1     1            557            600        -3      838
## 10  2013     1     1            558            600        -2      753
## # ℹ 336,766 more rows
## # ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

my_data <- rename(my_data, departure_time = dep_time)

Mutate

What if you want to add new columns to your data frame? we have the mutate() function for that.

First, lets make smaller data frame so we can see what we’re doing.

my_data_small <- select(my_data, year:day, distance, air_time)

Lets calculate the speed of the flights.

mutate(my_data_small, speed = distance / air_time * 60)

## # A tibble: 336,776 × 6
##     year month   day distance air_time speed
##    <int> <int> <int>    <dbl>    <dbl> <dbl>
##  1  2013     1     1     1400      227  370.
##  2  2013     1     1     1416      227  374.
##  3  2013     1     1     1089      160  408.
##  4  2013     1     1     1576      183  517.
##  5  2013     1     1      762      116  394.
##  6  2013     1     1      719      150  288.
##  7  2013     1     1     1065      158  404.
##  8  2013     1     1      229       53  259.
##  9  2013     1     1      944      140  405.
## 10  2013     1     1      733      138  319.
## # ℹ 336,766 more rows

my_data_small <- mutate(my_data_small, speed = distance / air_time * 60)

what if we wanted to create a new data frame with ONLY your calculations (transmute)

airspeed <- transmute(my_data_small, speed = distance / air_time * 60 , speed2 = distance / air_time)

Sunnarize and by_group()

we can use summarize to run a function on a data column to get a single return

summarize(my_data, delay = mean(dep_delay, na.rm = T))

## # A tibble: 1 × 1
##   delay
##   <dbl>
## 1  12.6

So we can see here the average delay is about 12 minutes

we gain additional value in summerize by pairing it with by_group()

by_day <- group_by(my_data, year, month, day)
summarize(by_day, delay = mean(dep_delay, na.rm = T))

## `summarise()` has grouped output by 'year', 'month'. You can override using the
## `.groups` argument.

## # A tibble: 365 × 4
## # Groups:   year, month [12]
##     year month   day delay
##    <int> <int> <int> <dbl>
##  1  2013     1     1 11.5 
##  2  2013     1     2 13.9 
##  3  2013     1     3 11.0 
##  4  2013     1     4  8.95
##  5  2013     1     5  5.73
##  6  2013     1     6  7.15
##  7  2013     1     7  5.42
##  8  2013     1     8  2.55
##  9  2013     1     9  2.28
## 10  2013     1    10  2.84
## # ℹ 355 more rows

As you can see, we now have the delay by the days of the year

Missing Data

What happens if we dont tell R what to do with the missing data?

summarize(by_day, delay = mean(dep_delay))

## `summarise()` has grouped output by 'year', 'month'. You can override using the
## `.groups` argument.

## # A tibble: 365 × 4
## # Groups:   year, month [12]
##     year month   day delay
##    <int> <int> <int> <dbl>
##  1  2013     1     1    NA
##  2  2013     1     2    NA
##  3  2013     1     3    NA
##  4  2013     1     4    NA
##  5  2013     1     5    NA
##  6  2013     1     6    NA
##  7  2013     1     7    NA
##  8  2013     1     8    NA
##  9  2013     1     9    NA
## 10  2013     1    10    NA
## # ℹ 355 more rows

We can also filter our data based on NA (which in this dataset was cancelled flights)

not_cancalled <- filter(my_data, !is.na(dep_delay), !is.na(arr_delay))

Lets run summarize again on this data

summarize(not_cancalled, delay = mean(dep_delay))

## # A tibble: 1 × 1
##   delay
##   <dbl>
## 1  12.6

Counts

We can also count the number of variables that are NA

sum(is.na(my_data$dep_delay))

## [1] 8255

We can also count the number that are NOT NA

sum(!is.na(my_data$dep_delay))

## [1] 328521

Piping

With tibble data sets (more on them soon), we can pipe results to get rid of the dollar signs We can use them to summarize the number of flights by minutes delayed.

my_data %>%
  group_by(year, month, day) %>%
  summarize(mean = mean(departure_time, na.rm = T))

## `summarise()` has grouped output by 'year', 'month'. You can override using the
## `.groups` argument.

## # A tibble: 365 × 4
## # Groups:   year, month [12]
##     year month   day  mean
##    <int> <int> <int> <dbl>
##  1  2013     1     1 1385.
##  2  2013     1     2 1354.
##  3  2013     1     3 1357.
##  4  2013     1     4 1348.
##  5  2013     1     5 1326.
##  6  2013     1     6 1399.
##  7  2013     1     7 1341.
##  8  2013     1     8 1335.
##  9  2013     1     9 1326.
## 10  2013     1    10 1333.
## # ℹ 355 more rows

Tibbles

load library

library(tibble)

Now we will take the time to explore tibbles. Tibbles are modified data frames which tweak some of the older Features from data frames. R is an old language, and useful things from 20 years ago are not as useful anymore.

as_tibble(iris)

## # A tibble: 150 × 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
##  1          5.1         3.5          1.4         0.2 setosa 
##  2          4.9         3            1.4         0.2 setosa 
##  3          4.7         3.2          1.3         0.2 setosa 
##  4          4.6         3.1          1.5         0.2 setosa 
##  5          5           3.6          1.4         0.2 setosa 
##  6          5.4         3.9          1.7         0.4 setosa 
##  7          4.6         3.4          1.4         0.3 setosa 
##  8          5           3.4          1.5         0.2 setosa 
##  9          4.4         2.9          1.4         0.2 setosa 
## 10          4.9         3.1          1.5         0.1 setosa 
## # ℹ 140 more rows

as we can see, we have the same data frame, but we have different features

you can also create a tibble from scratch with tibble()

tibble(
  x = 1:5,
  y = 1,
  z = x ^ 2 + y
)

## # A tibble: 5 × 3
##       x     y     z
##   <int> <dbl> <dbl>
## 1     1     1     2
## 2     2     1     5
## 3     3     1    10
## 4     4     1    17
## 5     5     1    26

you can also ues tribble() for basic data table creation

tribble(
  ~geneA, ~ geneB, ~ geneC,
  ##########################
  110,      112,      114,
  6,        5,        4
)

## # A tibble: 2 × 3
##   geneA geneB geneC
##   <dbl> <dbl> <dbl>
## 1   110   112   114
## 2     6     5     4

Tibbles are built to not overwhelm your console when printing data, only showing the first few lines

This is how a data frame prints

Subsetting

Subsetting tibbles is easy, similar to data.frames

df_tibble <- tibble(nycflights13::flights)

df_tibble

## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

We can subset by column name using the $

we can subset by position using [[]]

if you want to use this in a pipe, you need to use the “.” placeholder

some older functions do not like tibbles, thus might have to convert them back to dataframes

class(df_tibble)

## [1] "tbl_df"     "tbl"        "data.frame"

df_tibble_2 <- as.data.frame(df_tibble)

class(df_tibble_2)

## [1] "data.frame"

df_tibble

## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

head(df_tibble_2)

##   year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## 1 2013     1   1      517            515         2      830            819
## 2 2013     1   1      533            529         4      850            830
## 3 2013     1   1      542            540         2      923            850
## 4 2013     1   1      544            545        -1     1004           1022
## 5 2013     1   1      554            600        -6      812            837
## 6 2013     1   1      554            558        -4      740            728
##   arr_delay carrier flight tailnum origin dest air_time distance hour minute
## 1        11      UA   1545  N14228    EWR  IAH      227     1400    5     15
## 2        20      UA   1714  N24211    LGA  IAH      227     1416    5     29
## 3        33      AA   1141  N619AA    JFK  MIA      160     1089    5     40
## 4       -18      B6    725  N804JB    JFK  BQN      183     1576    5     45
## 5       -25      DL    461  N668DN    LGA  ATL      116      762    6      0
## 6        12      UA   1696  N39463    EWR  ORD      150      719    5     58
##             time_hour
## 1 2013-01-01 05:00:00
## 2 2013-01-01 05:00:00
## 3 2013-01-01 05:00:00
## 4 2013-01-01 05:00:00
## 5 2013-01-01 06:00:00
## 6 2013-01-01 05:00:00

tidyr

library(tidyverse)

how do we make a tidy dataset? well the tidyversse follows three rules.

each variable mist have its own column
each observation must have its own row
each value must have its own cell

Its impossible to satisfy two of the three rules.

This leads to the following instructions

put each dataset into a tibble
put each variable into a column
profit

picking one consistent method of data storage makes for easier understanding of your code an what is happening “under the hood” or behinde the scenes.

Lets now look at working with tibbles

library(tidyverse)
bmi <- tibble(women)

bmi %>%
  mutate(bmi = (703 * weight)/ (height)^2)

## # A tibble: 15 × 3
##    height weight   bmi
##     <dbl>  <dbl> <dbl>
##  1     58    115  24.0
##  2     59    117  23.6
##  3     60    120  23.4
##  4     61    123  23.2
##  5     62    126  23.0
##  6     63    129  22.8
##  7     64    132  22.7
##  8     65    135  22.5
##  9     66    139  22.4
## 10     67    142  22.2
## 11     68    146  22.2
## 12     69    150  22.1
## 13     70    154  22.1
## 14     71    159  22.2
## 15     72    164  22.2

Spreading and Gathering

Sometines you’ll find datasets that don’t fit well into a tibble We’ll use the built-in data from tidyverse for this part

table4a

## # A tibble: 3 × 3
##   country     `1999` `2000`
##   <chr>        <dbl>  <dbl>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766

as you can see form this data, we have one variable in column A (country) But columns B and C are two of the same, thus there are two observations in each row

To fix this we can use the gather function

table4a %>% 
  gather('1999', '2000' , key = 'year' , value ='cases')

## # A tibble: 6 × 3
##   country     year   cases
##   <chr>       <chr>  <dbl>
## 1 Afghanistan 1999     745
## 2 Brazil      1999   37737
## 3 China       1999  212258
## 4 Afghanistan 2000    2666
## 5 Brazil      2000   80488
## 6 China       2000  213766

lets look at another example

table4b

## # A tibble: 3 × 3
##   country         `1999`     `2000`
##   <chr>            <dbl>      <dbl>
## 1 Afghanistan   19987071   20595360
## 2 Brazil       172006362  174504898
## 3 China       1272915272 1280428583

as you can see we have the same problem in table 4b

table4b %>%
  gather ("1999", "2000", key = "year", value = "population")

## # A tibble: 6 × 3
##   country     year  population
##   <chr>       <chr>      <dbl>
## 1 Afghanistan 1999    19987071
## 2 Brazil      1999   172006362
## 3 China       1999  1272915272
## 4 Afghanistan 2000    20595360
## 5 Brazil      2000   174504898
## 6 China       2000  1280428583

now what if we want to join these two tables? we can use dplyr

table4a <- table4a %>% 
  gather('1999', '2000' , key = 'year' , value ='cases')
table4b <- table4b %>%
  gather ("1999", "2000", key = "year", value = "population")

left_join(table4a, table4b)

## Joining with `by = join_by(country, year)`

## # A tibble: 6 × 4
##   country     year   cases population
##   <chr>       <chr>  <dbl>      <dbl>
## 1 Afghanistan 1999     745   19987071
## 2 Brazil      1999   37737  172006362
## 3 China       1999  212258 1272915272
## 4 Afghanistan 2000    2666   20595360
## 5 Brazil      2000   80488  174504898
## 6 China       2000  213766 1280428583

Spreading

spreading is the opposite of gathering. Lets look at table 2

table2

## # A tibble: 12 × 4
##    country      year type            count
##    <chr>       <dbl> <chr>           <dbl>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583

you can see that we have redundant info in columns 1 and 2 we can fix that by combining rows 1 and 2, 3 and 4, ect.

spread(table2, key = type, value = count)

## # A tibble: 6 × 4
##   country      year  cases population
##   <chr>       <dbl>  <dbl>      <dbl>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

Type is the key of what we are turning into columns, the value is what becomes rows/observations

In summary spread makes long tables shorter and wideer gather makes wide tables narrower and longer

Separating and pull

Now what happens when we have two observations stuck in one column?

table3

## # A tibble: 6 × 3
##   country      year rate             
##   <chr>       <dbl> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583

As you can see, the rate iis just the population and cases combined. We can use seperate the fix this

table3%>%
  separate(rate, into = c("cases", "population"))

## # A tibble: 6 × 4
##   country      year cases  population
##   <chr>       <dbl> <chr>  <chr>     
## 1 Afghanistan  1999 745    19987071  
## 2 Afghanistan  2000 2666   20595360  
## 3 Brazil       1999 37737  172006362 
## 4 Brazil       2000 80488  174504898 
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

However, if you notice, the column type is not correct.

table3%>%
  separate(rate, into =c("cases", "populate"), conver = TRUE)

## # A tibble: 6 × 4
##   country      year  cases   populate
##   <chr>       <dbl>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

You can specify what you want to seperate based on.

table3%>%
  separate(rate, into =c("cases", "populate"), sep = "/", conver = TRUE)

## # A tibble: 6 × 4
##   country      year  cases   populate
##   <chr>       <dbl>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

Lets make this look more tidy

table3 %>%
  separate(
    year,
    into = c("centurey", "year"),
    convert = TRUE,
    sep = 2
  )

## # A tibble: 6 × 4
##   country     centurey  year rate             
##   <chr>          <int> <int> <chr>            
## 1 Afghanistan       19    99 745/19987071     
## 2 Afghanistan       20     0 2666/20595360    
## 3 Brazil            19    99 37737/172006362  
## 4 Brazil            20     0 80488/174504898  
## 5 China             19    99 212258/1272915272
## 6 China             20     0 213766/1280428583

Unite

What happens if we want to do the inverse of separate?

table5

## # A tibble: 6 × 4
##   country     century year  rate             
##   <chr>       <chr>   <chr> <chr>            
## 1 Afghanistan 19      99    745/19987071     
## 2 Afghanistan 20      00    2666/20595360    
## 3 Brazil      19      99    37737/172006362  
## 4 Brazil      20      00    80488/174504898  
## 5 China       19      99    212258/1272915272
## 6 China       20      00    213766/1280428583

table5 %>%
  unite(date, century, year, sep = "")

## # A tibble: 6 × 3
##   country     date  rate             
##   <chr>       <chr> <chr>            
## 1 Afghanistan 1999  745/19987071     
## 2 Afghanistan 2000  2666/20595360    
## 3 Brazil      1999  37737/172006362  
## 4 Brazil      2000  80488/174504898  
## 5 China       1999  212258/1272915272
## 6 China       2000  213766/1280428583

Missing values

There can be two types of missing values, NA (explicit) or just no entry (implicit)

gene_data <- tibble(
  gene = c('a', 'a', 'a', 'a', 'b', 'b', 'b'),
  nuc = c(20, 22, 24, 25, NA, 42, 67),
  run = c(1,2,3,4,2,3,4)
)

gene_data

## # A tibble: 7 × 3
##   gene    nuc   run
##   <chr> <dbl> <dbl>
## 1 a        20     1
## 2 a        22     2
## 3 a        24     3
## 4 a        25     4
## 5 b        NA     2
## 6 b        42     3
## 7 b        67     4

The nucelotide count for Gene B run 2 is explicit missing The nucleotide count for Gene B run 1 is implicitly missing

One way we can make implicit missing values explicit is by putting runs in columns

gene_data %>%
  spread(gene, nuc)

## # A tibble: 4 × 3
##     run     a     b
##   <dbl> <dbl> <dbl>
## 1     1    20    NA
## 2     2    22    NA
## 3     3    24    42
## 4     4    25    67

If we want to remove the missing values we can use spread and gather and na.rm =TRUE

gene_data %>%
  spread(gene, nuc) %>%
  gather(gene, nuc, 'a' : 'b', na.rm = TRUE)

## # A tibble: 6 × 3
##     run gene    nuc
##   <dbl> <chr> <dbl>
## 1     1 a        20
## 2     2 a        22
## 3     3 a        24
## 4     4 a        25
## 5     3 b        42
## 6     4 b        67

Another way we can make missing values implicit vs explicit, is complete()

gene_data %>%
  complete(gene, run)

## # A tibble: 8 × 3
##   gene    run   nuc
##   <chr> <dbl> <dbl>
## 1 a         1    20
## 2 a         2    22
## 3 a         3    24
## 4 a         4    25
## 5 b         1    NA
## 6 b         2    NA
## 7 b         3    42
## 8 b         4    67

Sometimes an NA is present to represent a value being varried forward

treatment <- tribble(
  ~person,            ~treatment,             ~response,
  ####################################################
  "issac",                1,                      7,
  NA,                     2,                      10,
  NA,                     3,                      9,
  "VDB",                  1,                      8,
  NA,                     2,                      11,
  NA,                     3,                      10
)

treatment

## # A tibble: 6 × 3
##   person treatment response
##   <chr>      <dbl>    <dbl>
## 1 issac          1        7
## 2 <NA>           2       10
## 3 <NA>           3        9
## 4 VDB            1        8
## 5 <NA>           2       11
## 6 <NA>           3       10

What we can do here is use the full() option

treatment %>%
  fill(person)

## # A tibble: 6 × 3
##   person treatment response
##   <chr>      <dbl>    <dbl>
## 1 issac          1        7
## 2 issac          2       10
## 3 issac          3        9
## 4 VDB            1        8
## 5 VDB            2       11
## 6 VDB            3       10

DPLYR

It is rare that you will be working with a simple data table. The DYPLR package allows you to join two data tabels based on common values.

Mutate joins - add new variables to one data frame from the matching observations in another
Filtering joins - filters observation from one data frame based n whether or not they present in another
set operations - treats observations as they are set elemetns

Load libraries

library(tidyverse)
library(nycflights13)

Lets pull full carrier names based on letter codes

airlines

## # A tibble: 16 × 2
##    carrier name                       
##    <chr>   <chr>                      
##  1 9E      Endeavor Air Inc.          
##  2 AA      American Airlines Inc.     
##  3 AS      Alaska Airlines Inc.       
##  4 B6      JetBlue Airways            
##  5 DL      Delta Air Lines Inc.       
##  6 EV      ExpressJet Airlines Inc.   
##  7 F9      Frontier Airlines Inc.     
##  8 FL      AirTran Airways Corporation
##  9 HA      Hawaiian Airlines Inc.     
## 10 MQ      Envoy Air                  
## 11 OO      SkyWest Airlines Inc.      
## 12 UA      United Air Lines Inc.      
## 13 US      US Airways Inc.            
## 14 VX      Virgin America             
## 15 WN      Southwest Airlines Co.     
## 16 YV      Mesa Airlines Inc.

lets get info about airports

Lets get info about each plane

Lets get some info on the weather at the airporots

lets get info on singular flights

Lets look at how these tables connect

flights –> planes based on tailnumber
Flights –> airlines through carrier
flights –> airports through orgin and destination
flights –> weather via orgin, year/month/day/hour

Keys

Keys are unique identifieers per observation primary key uniquley identifies an observation in its own table

one way to identify a primary key is as follows

planes %>%
  count(tailnum) %>%
  filter(n>1)

## # A tibble: 0 × 2
## # ℹ 2 variables: tailnum <chr>, n <int>

This indicated that the tailnumber is unique

planes %>%
  count(model) %>%
  filter(n>1)

## # A tibble: 79 × 2
##    model       n
##    <chr>   <int>
##  1 717-200    88
##  2 737-301     2
##  3 737-3G7     2
##  4 737-3H4   105
##  5 737-3K2     2
##  6 737-3L9     2
##  7 737-3Q8     5
##  8 737-3TO     2
##  9 737-401     4
## 10 737-4B7    18
## # ℹ 69 more rows

Mutate join

We have now added tthe airline name to our dataframe from the airline dataframe

other types of joins 1. inner jpins (inner_join) matches a pair of observations when their key is equal 2. outer joins (outer_joins) keeps observations that appear in at least one table.

Stringr

load libraries

library(tidyverse)
library(stringr)

you can create strings using single or double quotes

string1 <- "This is a string"
string2 <- 'to put a "quote" in your string, ue the opposite'

string1

## [1] "This is a string"

string2

## [1] "to put a \"quote\" in your string, ue the opposite"

if you forget to close your string, you’ll get this: - string3 <- “where iss this string going? need to use paratnthes to close string if it does happen hit escape and try again

string3 <- "where is this string going?"

string3

## [1] "where is this string going?"

multiple strings are stored in character vectors

string4 <- c("one", "two", "three")
string4

## [1] "one"   "two"   "three"

measuring string length

str_length(string3)

## [1] 27

str_length(string4)

## [1] 3 3 5

lets combine two strings

str_c("x", "y")

## [1] "xy"

str_c(string1, string2)

## [1] "This is a stringto put a \"quote\" in your string, ue the opposite"

you can use sep to control how they are seperated

str_c(string1, string2, ssep = " ")

## [1] "This is a stringto put a \"quote\" in your string, ue the opposite "

str_c("x", "y", "z", sep = "_")

## [1] "x_y_z"

Subsetting Strings

you can ssubset a string using str_sub()

HSP <- c("HSP123", "HSP234", "HSP456")

str_sub(HSP, 4,6)

## [1] "123" "234" "456"

this just drops the first four letters from the strings

or you can use negatives to count back from the end

str_sub(HSP, -3, -1)

## [1] "123" "234" "456"

you can convert the cases of strings like follows:

HSP

## [1] "HSP123" "HSP234" "HSP456"

str_to_lower(HSP)

## [1] "hsp123" "hsp234" "hsp456"

str_to_upper() - makes everything upper case instead of lower case

Regular Expression

install.packages("htmlwidgets")

## Installing package into '/home/student/R/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)

x <- c('ATTAGA', 'CGCCCCCGGAT', 'TATTA')

str_view(x, "G")

## [1] │ ATTA<G>A
## [2] │ C<G>CCCCC<G><G>AT

str_view(x, "TA")

## [1] │ AT<TA>GA
## [3] │ <TA>T<TA>

the next step is, “.” where the “.” matches an entry

str_view(x, ".G.")

## [1] │ ATT<AGA>
## [2] │ <CGC>CCC<CGG>AT

Anchor allow you to match at the start or the ending

str_view(x, "^TA")

## [3] │ <TA>TTA

str_view(x, "TA$")

## [3] │ TAT<TA>

###character classes/alternatives

d matches any digit
s matches any space
[abc] matches a, b, or c

str_view(x, "TA[GT]")

## [1] │ AT<TAG>A
## [3] │ <TAT>TA

[^anc] matches anything BUT a, b, or c

str_view(x, "TA[^T]")

## [1] │ AT<TAG>A

you can also use | to pick between two alternatives

str_view(x, "TA[G|T]")

## [1] │ AT<TAG>A
## [3] │ <TAT>TA

Detect Matches

str_detect () returns a logical vector the same length of input

y <- c("apple", "bannana", "pear")
y

## [1] "apple"   "bannana" "pear"

str_detect(y, "e")

## [1]  TRUE FALSE  TRUE

How many common words start with letter e

Lets get more complex, what proportion words end in a vowel?

Now lets extract

you can also use str count to say how many matches there are in a string

lets couple this with mutate

df <- tibble(
  word = words,
  count = seq_along(word)
)

df

## # A tibble: 980 × 2
##    word     count
##    <chr>    <int>
##  1 a            1
##  2 able         2
##  3 about        3
##  4 absolute     4
##  5 accept       5
##  6 account      6
##  7 achieve      7
##  8 across       8
##  9 act          9
## 10 active      10
## # ℹ 970 more rows

df%>%
  mutate(
    vowels = str_count(words, "[aeiou]"),
    consstonants = str_count(words, "[^aeiou]")
  )

## # A tibble: 980 × 4
##    word     count vowels consstonants
##    <chr>    <int>  <int>        <int>
##  1 a            1      1            0
##  2 able         2      2            2
##  3 about        3      3            2
##  4 absolute     4      4            4
##  5 accept       5      2            4
##  6 account      6      3            4
##  7 achieve      7      4            3
##  8 across       8      2            4
##  9 act          9      1            2
## 10 active      10      3            3
## # ℹ 970 more rows

Miccroarrays

Load Libraries

library(ath1121501cdf)

## Warning: replacing previous import 'AnnotationDbi::tail' by 'utils::tail' when
## loading 'ath1121501cdf'

## Warning: replacing previous import 'AnnotationDbi::head' by 'utils::head' when
## loading 'ath1121501cdf'

##

library(ath1121501.db)

## Loading required package: AnnotationDbi

## Loading required package: stats4

## Loading required package: BiocGenerics

## 
## Attaching package: 'BiocGenerics'

## The following objects are masked from 'package:lubridate':
## 
##     intersect, setdiff, union

## The following objects are masked from 'package:dplyr':
## 
##     combine, intersect, setdiff, union

## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs

## The following objects are masked from 'package:base':
## 
##     anyDuplicated, aperm, append, as.data.frame, basename, cbind,
##     colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
##     get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
##     match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
##     Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
##     table, tapply, union, unique, unsplit, which.max, which.min

## Loading required package: Biobase

## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.

## Loading required package: IRanges

## Loading required package: S4Vectors

## 
## Attaching package: 'S4Vectors'

## The following objects are masked from 'package:lubridate':
## 
##     second, second<-

## The following objects are masked from 'package:dplyr':
## 
##     first, rename

## The following object is masked from 'package:tidyr':
## 
##     expand

## The following object is masked from 'package:utils':
## 
##     findMatches

## The following objects are masked from 'package:base':
## 
##     expand.grid, I, unname

## 
## Attaching package: 'IRanges'

## The following object is masked from 'package:lubridate':
## 
##     %within%

## The following objects are masked from 'package:dplyr':
## 
##     collapse, desc, slice

## The following object is masked from 'package:purrr':
## 
##     reduce

## 
## Attaching package: 'AnnotationDbi'

## The following object is masked from 'package:dplyr':
## 
##     select

## Loading required package: org.At.tair.db

##

##

library(annotate)

## Loading required package: XML

library(affy)

## 
## Attaching package: 'affy'

## The following object is masked from 'package:lubridate':
## 
##     pm

library(limma)

## 
## Attaching package: 'limma'

## The following object is masked from 'package:BiocGenerics':
## 
##     plotMA

library(oligo)

## Loading required package: oligoClasses

## Welcome to oligoClasses version 1.64.0

## 
## Attaching package: 'oligoClasses'

## The following object is masked from 'package:affy':
## 
##     list.celfiles

## Loading required package: Biostrings

## Loading required package: XVector

## 
## Attaching package: 'XVector'

## The following object is masked from 'package:purrr':
## 
##     compact

## Loading required package: GenomeInfoDb

## 
## Attaching package: 'Biostrings'

## The following object is masked from 'package:base':
## 
##     strsplit

## ================================================================================

## Welcome to oligo version 1.66.0

## ================================================================================

## 
## Attaching package: 'oligo'

## The following object is masked from 'package:limma':
## 
##     backgroundCorrect

## The following objects are masked from 'package:affy':
## 
##     intensity, MAplot, mm, mm<-, mmindex, pm, pm<-, pmindex,
##     probeNames, rma

## The following object is masked from 'package:lubridate':
## 
##     pm

## The following object is masked from 'package:dplyr':
## 
##     summarize

library(dplyr)
library(reshape)

## 
## Attaching package: 'reshape'

## The following objects are masked from 'package:S4Vectors':
## 
##     expand, rename

## The following object is masked from 'package:lubridate':
## 
##     stamp

## The following object is masked from 'package:dplyr':
## 
##     rename

## The following objects are masked from 'package:tidyr':
## 
##     expand, smiths

Read the cel files

targets <- readTargets("Bric16_Targets.csv", sep = ",", row.names = "filename")

ab <- ReadAffy()

hist(ab)

Normlized microarray experiments

eset <- affy::rma(ab)

## Background correcting
## Normalizing
## Calculating Expression

pData(eset)

##                                          sample
## Atha_Ler-0_sShoots_FLT_Rep1_F-F2_raw.CEL      1
## Atha_Ler-0_sShoots_FLT_Rep2_F-F3_raw.CEL      2
## Atha_Ler-0_sShoots_FLT_Rep3_F-F4_raw.CEL      3
## Atha_Ler-0_sShoots_GC_Rep1_H-F2_raw.CEL       4
## Atha_Ler-0_sShoots_GC_Rep2_H-F3_raw.CEL       5
## Atha_Ler-0_sShoots_GC_Rep3_H-F4_raw.CEL       6

Normalized data visulized

hist(eset)

charaterize the data

ID <- featureNames(eset)
symbol <- getSYMBOL(ID, "ath1121501.db")

affydata <- read.delim("AFFY_ATH1_array_elements.txt")

## Warning in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,
## : EOF within quoted string

###Differntial Gene Expression Analyssis

Fligth vs Ground

condition <- targets$gravity

design <- model.matrix(~factor(condition))
colnames(design) <- c("flight", "ground")

fit <- lmFit(eset, design)
fit <- eBayes(fit)
options(digits = 2)

top <- topTable(fit, coef =2, n=Inf, adjust='fdr')

Lets combine annotations

Annot <- data.frame(Gene = sapply(contents(ath1121501GENENAME), paste, collapse = ",'"), 
                   KEGG = sapply(contents(ath1121501PATH), paste, collapse = ","),
                   GO = sapply(contents(ath1121501GO), paste, collapse = ","),
                   SYMBOL = sapply(contents(ath1121501SYMBOL), paste, collapse = ","),
                   ACCNUM = sapply(contents(ath1121501ACCNUM), paste, collapse = ","))

## Warning:   The contents() method for Bimap objects is deprecated. Please use
##   as.list() instead.

## Warning:   The contents() method for Bimap objects is deprecated. Please use
##   as.list() instead.

## Warning:   The contents() method for Bimap objects is deprecated. Please use
##   as.list() instead.

## Warning:   The contents() method for Bimap objects is deprecated. Please use
##   as.list() instead.

## Warning:   The contents() method for Bimap objects is deprecated. Please use
##   as.list() instead.

Merge all data into one data frame

all <- merge(Annot, top, by.x =0, by.y =0, all =TRUE)

all2 <- merge(all, affydata, by.x = "Row.names", by.y = "array_element_name")

Convert ACCNUM to character value

all$ACCNUM <- as.character(all$ACCNUM)

write.table(all2, file="BRIC_16_Final.csv", row.names = TRUE, col.names = TRUE, sep = "\t")

columns(org.At.tair.db)

##  [1] "ARACYC"       "ARACYCENZYME" "ENTREZID"     "ENZYME"       "EVIDENCE"    
##  [6] "EVIDENCEALL"  "GENENAME"     "GO"           "GOALL"        "ONTOLOGY"    
## [11] "ONTOLOGYALL"  "PATH"         "PMID"         "REFSEQ"       "SYMBOL"      
## [16] "TAIR"

foldchanges <- as.data.frame(all$logFC)

all2$entrez = mapIds(org.At.tair.db,
                     keys = all2$ACCNUM,
                     column = "ENTREZID",
                     keytype = "TAIR",
                     multivals = "first")

## 'select()' returned 1:1 mapping between keys and columns

head(all2, 10)

##    Row.names                 Gene KEGG
## 1  244903_at hypothetical protein   NA
## 2  244904_at hypothetical protein   NA
## 3  244906_at hypothetical protein   NA
## 4  244907_at hypothetical protein   NA
## 5  244908_at hypothetical protein   NA
## 6  244911_at hypothetical protein   NA
## 7  244913_at hypothetical protein   NA
## 8  244914_at hypothetical protein   NA
## 9  244916_at hypothetical protein   NA
## 10 244917_at hypothetical protein   NA
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         GO
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     list(GOID = "GO:0008150", Evidence = "ND", Ontology = "BP"),list(GOID = "GO:0005739", Evidence = "ISM", Ontology = "CC"),list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 list(GOID = "GO:0005739", Evidence = "ISM", Ontology = "CC"),list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     list(GOID = "GO:0008150", Evidence = "ND", Ontology = "BP"),list(GOID = "GO:0005739", Evidence = "ISM", Ontology = "CC"),list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     list(GOID = "GO:0008150", Evidence = "ND", Ontology = "BP"),list(GOID = "GO:0005739", Evidence = "ISM", Ontology = "CC"),list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     list(GOID = "GO:0008150", Evidence = "ND", Ontology = "BP"),list(GOID = "GO:0005739", Evidence = "ISM", Ontology = "CC"),list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      list(GOID = "GO:0008150", Evidence = "ND", Ontology = "BP"),list(GOID = "GO:0005575", Evidence = "ND", Ontology = "CC"),list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 7                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     list(GOID = "GO:0008150", Evidence = "ND", Ontology = "BP"),list(GOID = "GO:0005739", Evidence = "ISM", Ontology = "CC"),list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 8  list(GOID = "GO:0018130", Evidence = "IEA", Ontology = "BP"),list(GOID = "GO:0018130", Evidence = "IEA", Ontology = "BP"),list(GOID = "GO:0018130", Evidence = "IEA", Ontology = "BP"),list(GOID = "GO:0019438", Evidence = "IEA", Ontology = "BP"),list(GOID = "GO:0019438", Evidence = "IEA", Ontology = "BP"),list(GOID = "GO:0019438", Evidence = "IEA", Ontology = "BP"),list(GOID = "GO:0044271", Evidence = "IEA", Ontology = "BP"),list(GOID = "GO:0044271", Evidence = "IEA", Ontology = "BP"),list(GOID = "GO:0044271", Evidence = "IEA", Ontology = "BP"),list(GOID = "GO:1901362", Evidence = "IEA", Ontology = "BP"),list(GOID = "GO:1901362", Evidence = "IEA", Ontology = "BP"),list(GOID = "GO:0005739", Evidence = "ISM", Ontology = "CC"),list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 9                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     list(GOID = "GO:0008150", Evidence = "ND", Ontology = "BP"),list(GOID = "GO:0005634", Evidence = "ISM", Ontology = "CC"),list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
## 10                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    list(GOID = "GO:0008150", Evidence = "ND", Ontology = "BP"),list(GOID = "GO:0005739", Evidence = "ISM", Ontology = "CC"),list(GOID = "GO:0003674", Evidence = "ND", Ontology = "MF")
##     SYMBOL    ACCNUM  logFC AveExpr     t P.Value adj.P.Val    B
## 1   ORF149 ATMG00660 -1.214     7.8 -6.20  0.0004     0.018  0.5
## 2   ORF275 ATMG00670 -0.568     3.2 -2.07  0.0758     0.300 -4.9
## 3  ORF240A ATMG00690  0.055     9.1  0.18  0.8586     0.947 -6.8
## 4   ORF120 ATMG00710 -0.883     4.3 -2.91  0.0220     0.154 -3.7
## 5  ORF107D ATMG00720 -0.463     2.2 -1.73  0.1262     0.396 -5.4
## 6   ORF170 ATMG00820 -0.193     2.7 -0.72  0.4948     0.749 -6.5
## 7  ORF121B ATMG00840 -0.339     1.8 -1.18  0.2764     0.574 -6.1
## 8  ORF107E ATMG00850 -0.305     5.5 -1.30  0.2338     0.534 -6.0
## 9   ORF187 ATMG00880 -0.888     2.8 -2.32  0.0527     0.245 -4.6
## 10  ORF184 ATMG00870 -0.379     3.6 -1.31  0.2307     0.531 -5.9
##    array_element_type             organism is_control     locus
## 1     oligonucleotide Arabidopsis thaliana         no ATMG00660
## 2     oligonucleotide Arabidopsis thaliana         no ATMG00670
## 3     oligonucleotide Arabidopsis thaliana         no ATMG00690
## 4     oligonucleotide Arabidopsis thaliana         no ATMG00710
## 5     oligonucleotide Arabidopsis thaliana         no ATMG00720
## 6     oligonucleotide Arabidopsis thaliana         no ATMG00820
## 7     oligonucleotide Arabidopsis thaliana         no AT2G07626
## 8     oligonucleotide Arabidopsis thaliana         no ATMG00850
## 9     oligonucleotide Arabidopsis thaliana         no ATMG00880
## 10    oligonucleotide Arabidopsis thaliana         no ATMG00870
##                                                                                description
## 1                                                  hypothetical protein;(source:Araport11)
## 2                                                 transmembrane protein;(source:Araport11)
## 3                                                     FO-ATPase subunit;(source:Araport11)
## 4  Polynucleotidyl transferase, ribonuclease H-like superfamily protein;(source:Araport11)
## 5                                                  hypothetical protein;(source:Araport11)
## 6                  Reverse transcriptase (RNA-dependent DNA polymerase);(source:Araport11)
## 7                                                  hypothetical protein;(source:Araport11)
## 8                               DNA/RNA polymerases superfamily protein;(source:Araport11)
## 9                                                  hypothetical protein;(source:Araport11)
## 10                                                 hypothetical protein;(source:Araport11)
##    chromosome entrez
## 1           M   <NA>
## 2           M   <NA>
## 3           M   <NA>
## 4           M   <NA>
## 5           M   <NA>
## 6           M   <NA>
## 7           2   <NA>
## 8           M   <NA>
## 9           M   <NA>
## 10          M   <NA>

##Pathview

library(pathview)

##

## ##############################################################################
## Pathview is an open source software package distributed under GNU General
## Public License version 3 (GPLv3). Details of GPLv3 is available at
## http://www.gnu.org/licenses/gpl-3.0.html. Particullary, users are required to
## formally cite the original Pathview paper (not just mention it) in publications
## or products. For details, do citation("pathview") within R.
## 
## The pathview downloads and uses KEGG data. Non-academic uses may require a KEGG
## license agreement (details at http://www.kegg.jp/kegg/legal.html).
## ##############################################################################

library(gage)

##

library(gageData)
data("kegg.sets.hs")
data(kegg.sets.hs)

foldchanges = all2$logFC
names(foldchanges) = all2$entrez

head(foldchanges)

##   <NA>   <NA>   <NA>   <NA>   <NA>   <NA> 
## -1.214 -0.568  0.055 -0.883 -0.463 -0.193

kegg.ath = kegg.gsets("ath", id.type = "entrez")
kegg.ath.sigmet = kegg.ath$kg.sets[kegg.ath$sigmet.idx]

Lets get the results

keggres = gage(foldchanges, gsets=kegg.ath.sigmet, same.dir =TRUE)

lapply(keggres, head)

## $greater
##                                                   p.geomean stat.mean   p.val
## ath03010 Ribosome                                   1.5e-14       8.2 1.5e-14
## ath01230 Biosynthesis of amino acids                3.1e-04       3.5 3.1e-04
## ath00040 Pentose and glucuronate interconversions   2.0e-03       3.0 2.0e-03
## ath00195 Photosynthesis                             2.6e-03       3.0 2.6e-03
## ath00966 Glucosinolate biosynthesis                 7.0e-03       2.7 7.0e-03
## ath01232 Nucleotide metabolism                      9.2e-03       2.4 9.2e-03
##                                                     q.val set.size    exp1
## ath03010 Ribosome                                 1.9e-12      129 1.5e-14
## ath01230 Biosynthesis of amino acids              1.9e-02       85 3.1e-04
## ath00040 Pentose and glucuronate interconversions 8.0e-02       49 2.0e-03
## ath00195 Photosynthesis                           8.0e-02       19 2.6e-03
## ath00966 Glucosinolate biosynthesis               1.6e-01       12 7.0e-03
## ath01232 Nucleotide metabolism                    1.6e-01       42 9.2e-03
## 
## $less
##                                          p.geomean stat.mean p.val q.val
## ath04120 Ubiquitin mediated proteolysis      0.043      -1.7 0.043     1
## ath04016 MAPK signaling pathway - plant      0.043      -1.7 0.043     1
## ath00592 alpha-Linolenic acid metabolism     0.045      -1.7 0.045     1
## ath03040 Spliceosome                         0.075      -1.4 0.075     1
## ath00350 Tyrosine metabolism                 0.093      -1.4 0.093     1
## ath00906 Carotenoid biosynthesis             0.116      -1.2 0.116     1
##                                          set.size  exp1
## ath04120 Ubiquitin mediated proteolysis        63 0.043
## ath04016 MAPK signaling pathway - plant        74 0.043
## ath00592 alpha-Linolenic acid metabolism       19 0.045
## ath03040 Spliceosome                           77 0.075
## ath00350 Tyrosine metabolism                   19 0.093
## ath00906 Carotenoid biosynthesis               18 0.116
## 
## $stats
##                                                   stat.mean exp1
## ath03010 Ribosome                                       8.2  8.2
## ath01230 Biosynthesis of amino acids                    3.5  3.5
## ath00040 Pentose and glucuronate interconversions       3.0  3.0
## ath00195 Photosynthesis                                 3.0  3.0
## ath00966 Glucosinolate biosynthesis                     2.7  2.7
## ath01232 Nucleotide metabolism                          2.4  2.4

greater <- keggres$greater
lessers <- keggres$less

write.csv(greater, file = "BRIC_16_pathview_Greater_Pathways.csv")
write.csv(greater, file = "BRIC_16_pathview_Lesser_Pathways.csv")

Map to pathways (greater)

keggrespathways = data.frame(id=rownames(keggres$greater), keggres$greater) %>%
  tbl_df() %>%
  filter(row_number() <=5) %>%
  .$id %>%
  as.character()

## Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
## ℹ Please use `tibble::as_tibble()` instead.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

keggrespathways

## [1] "ath03010 Ribosome"                                
## [2] "ath01230 Biosynthesis of amino acids"             
## [3] "ath00040 Pentose and glucuronate interconversions"
## [4] "ath00195 Photosynthesis"                          
## [5] "ath00966 Glucosinolate biosynthesis"

keggresids_greater = substr(keggrespathways, start=1, stop=8)
keggresids_greater

## [1] "ath03010" "ath01230" "ath00040" "ath00195" "ath00966"

genedata <- as.character(all2$logFC)

Define a plotting function

plot_pathway = function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = "ath", new.signature =FALSE)

#Plot multiple pathways 

tmp = sapply(keggresids_greater, function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = "ath", kegg.native = FALSE))

## Info: Getting gene ID data from KEGG...

## Info: Done with data retrieval!

## Note: ath03010 not rendered, 0 or 1 connected nodes!
## Try "kegg.native=T" instead!

## Info: Getting gene ID data from KEGG...

## Info: Done with data retrieval!

## Info: Working in directory /home/student/Desktop/classroom/myfiles

## Info: Writing image file ath01230.pathview.pdf

## Info: Getting gene ID data from KEGG...

## Info: Done with data retrieval!

## Warning in .local(from, to, graph): edges replaced: '217|104', '217|216'

## Info: Working in directory /home/student/Desktop/classroom/myfiles

## Info: Writing image file ath00040.pathview.pdf

## Info: Getting gene ID data from KEGG...

## Info: Done with data retrieval!

## Note: ath00195 not rendered, 0 or 1 connected nodes!
## Try "kegg.native=T" instead!

## Info: Getting gene ID data from KEGG...

## Info: Done with data retrieval!

## Info: Working in directory /home/student/Desktop/classroom/myfiles

## Info: Writing image file ath00966.pathview.pdf

map pathways to lessers

keggrespathways = data.frame(id=rownames(keggres$less), keggres$less) %>%
  tbl_df() %>%
  filter(row_number() <=5) %>%
  .$id %>%
  as.character()

## Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
## ℹ Please use `tibble::as_tibble()` instead.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

keggrespathways

## [1] "ath04120 Ubiquitin mediated proteolysis" 
## [2] "ath04016 MAPK signaling pathway - plant" 
## [3] "ath00592 alpha-Linolenic acid metabolism"
## [4] "ath03040 Spliceosome"                    
## [5] "ath00350 Tyrosine metabolism"

keggresids_less = substr(keggrespathways, start=1, stop=8)
keggresids_less

## [1] "ath04120" "ath04016" "ath00592" "ath03040" "ath00350"

genedata <- as.character(all2$logFC)

Define a plotting function

plot_pathway = function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = "ath", new.signature =FALSE)

Plot multiple pathways

tmp = sapply(keggresids_less, function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = "ath", kegg.native = FALSE))

## Info: Getting gene ID data from KEGG...

## Info: Done with data retrieval!

## Note: ath04120 not rendered, 0 or 1 connected nodes!
## Try "kegg.native=T" instead!

## Info: Getting gene ID data from KEGG...

## Info: Done with data retrieval!

## Warning: reconcile groups sharing member nodes!

##      [,1] [,2] 
## [1,] "7"  "228"
## [2,] "8"  "228"
## [3,] "7"  "229"
## [4,] "8"  "229"

## Info: Working in directory /home/student/Desktop/classroom/myfiles

## Info: Writing image file ath04016.pathview.pdf

## Info: Getting gene ID data from KEGG...

## Info: Done with data retrieval!

## Info: Working in directory /home/student/Desktop/classroom/myfiles

## Info: Writing image file ath00592.pathview.pdf

## Info: Getting gene ID data from KEGG...

## Info: Done with data retrieval!

## Note: ath03040 not rendered, 0 or 1 connected nodes!
## Try "kegg.native=T" instead!

## Info: Getting gene ID data from KEGG...

## Info: Done with data retrieval!

## Info: Working in directory /home/student/Desktop/classroom/myfiles

## Info: Writing image file ath00350.pathview.pdf

RNA Seq

EdgeR

load Libraries

library("edgeR")
library("dplyr")
library("AnnotationDbi")
library("org.Mm.eg.db")

##

rawdata = read.csv("GLDS-102_rna_seq_Normalized_Counts.csv", row.names = "gene_ID")

group <- factor(c('1', '1', '1', '1', '1', '1', '2', '2', '2', '2', '2', '2'))

dgeGlm <- DGEList(counts = rawdata, group = group)
str(dgeGlm)

## Formal class 'DGEList' [package "edgeR"] with 1 slot
##   ..@ .Data:List of 2
##   .. ..$ : num [1:24035, 1:12] 2976.8 59.8 21.2 40.1 0 ...
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:24035] "ENSMUSG00000000001" "ENSMUSG00000000028" "ENSMUSG00000000031" "ENSMUSG00000000037" ...
##   .. .. .. ..$ : chr [1:12] "Mmus_C57_6J_KDN_FLT_Rep1_M23" "Mmus_C57_6J_KDN_FLT_Rep2_M24" "Mmus_C57_6J_KDN_FLT_Rep3_M25" "Mmus_C57_6J_KDN_FLT_Rep4_M26" ...
##   .. ..$ :'data.frame':  12 obs. of  3 variables:
##   .. .. ..$ group       : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 2 2 2 2 ...
##   .. .. ..$ lib.size    : num [1:12] 40266365 40740336 37739541 39196969 36820645 ...
##   .. .. ..$ norm.factors: num [1:12] 1 1 1 1 1 1 1 1 1 1 ...
##   ..$ names: chr [1:2] "counts" "samples"

keep <- rowSums(cpm(dgeGlm)>2) >=4

dgeGlm <- dgeGlm[keep,]

names(dgeGlm)

## [1] "counts"  "samples"

dgeGlm[["samples"]]

##                              group lib.size norm.factors
## Mmus_C57_6J_KDN_FLT_Rep1_M23     1  4.0e+07            1
## Mmus_C57_6J_KDN_FLT_Rep2_M24     1  4.1e+07            1
## Mmus_C57_6J_KDN_FLT_Rep3_M25     1  3.8e+07            1
## Mmus_C57_6J_KDN_FLT_Rep4_M26     1  3.9e+07            1
## Mmus_C57_6J_KDN_FLT_Rep5_M27     1  3.7e+07            1
## Mmus_C57_6J_KDN_FLT_Rep6_M28     1  3.6e+07            1
## Mmus_C57_6J_KDN_GC_Rep1_M33      2  3.7e+07            1
## Mmus_C57_6J_KDN_GC_Rep2_M34      2  3.7e+07            1
## Mmus_C57_6J_KDN_GC_Rep3_M35      2  4.0e+07            1
## Mmus_C57_6J_KDN_GC_Rep4_M36      2  3.6e+07            1
## Mmus_C57_6J_KDN_GC_Rep5_M37      2  3.8e+07            1
## Mmus_C57_6J_KDN_GC_Rep6_M38      2  3.5e+07            1

design <- model.matrix(~group)
design

##    (Intercept) group2
## 1            1      0
## 2            1      0
## 3            1      0
## 4            1      0
## 5            1      0
## 6            1      0
## 7            1      1
## 8            1      1
## 9            1      1
## 10           1      1
## 11           1      1
## 12           1      1
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"

dgeGlmComDisp <- estimateGLMCommonDisp(dgeGlm, design, verbose = TRUE)

## Disp = 0.026 , BCV = 0.16

dgeGlmTrendDrisp <- estimateGLMTrendedDisp(dgeGlmComDisp, design)

dgeGlmTagDisp <- estimateGLMTagwiseDisp(dgeGlmTrendDrisp, design)

plotBCV(dgeGlmTagDisp)

fit <- glmFit(dgeGlmTagDisp, design)

colnames(coef(fit))

## [1] "(Intercept)" "group2"

lrt <- glmLRT(fit, coef = 2)

ttGlm <- topTags(lrt, n = Inf)

class(ttGlm)

## [1] "TopTags"
## attr(,"package")
## [1] "edgeR"

summary(deGlm <- decideTestsDGE(lrt, p = 0.1, adjust = "fdr"))

##        group2
## Down       64
## NotSig  13390
## Up        159

tagsGlm <- rownames(dgeGlmTagDisp)[as.logical(deGlm)]

plotSmear(lrt, de.tags = tagsGlm)

hits2 <- ttGlm$table[ttGlm$table$FDR < 0.1, ]

write.csv(hits2, "Mouse_EdgeR_Results_Zero_vs_1")

columns(org.Mm.eg.db)

##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
##  [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
## [11] "GENETYPE"     "GO"           "GOALL"        "IPI"          "MGI"         
## [16] "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"        
## [21] "PROSITE"      "REFSEQ"       "SYMBOL"       "UNIPROT"

ttGlm2 <- as.data.frame(ttGlm$table)

ttGlm2$symbol = mapIds(org.Mm.eg.db,
                       keys=row.names(ttGlm2),
                       column = "SYMBOL",
                       keytype = "ENSEMBL",
                       multiVals = "first")

## 'select()' returned 1:many mapping between keys and columns

ttGlm2$entrez = mapIds(org.Mm.eg.db,
                       keys=row.names(ttGlm2),
                       column = "ENTREZID",
                       keytype = "ENSEMBL",
                       multiVals = "first")

## 'select()' returned 1:many mapping between keys and columns

ttGlm2$name = mapIds(org.Mm.eg.db,
                       keys=row.names(ttGlm2),
                       column = "GENENAME",
                       keytype = "ENSEMBL",
                       multiVals = "first")

## 'select()' returned 1:many mapping between keys and columns

head(ttGlm2)

##                    logFC logCPM LR  PValue     FDR symbol entrez
## ENSMUSG00000026077 -1.36    3.6 80 4.3e-19 5.9e-15  Npas2  18143
## ENSMUSG00000007872  0.89    5.5 77 1.9e-18 1.3e-14    Id3  15903
## ENSMUSG00000021775  0.95    6.2 63 2.0e-15 9.1e-12  Nr1d2 353187
## ENSMUSG00000002250 -0.83    5.3 62 2.7e-15 9.2e-12  Ppard  19015
## ENSMUSG00000059824  2.26    4.6 58 2.6e-14 7.2e-11    Dbp  13170
## ENSMUSG00000074715 -1.99    3.8 47 7.0e-12 1.6e-08  Ccl28  56838
##                                                                name
## ENSMUSG00000026077                    neuronal PAS domain protein 2
## ENSMUSG00000007872                       inhibitor of DNA binding 3
## ENSMUSG00000021775  nuclear receptor subfamily 1, group D, member 2
## ENSMUSG00000002250 peroxisome proliferator activator receptor delta
## ENSMUSG00000059824          D site albumin promoter binding protein
## ENSMUSG00000074715                    C-C motif chemokine ligand 28

library(pathview)
library(gage)
library(gageData)
data(kegg.sets.mm)
data(go.subs.mm)

foldchanges <- ttGlm2$logFC
names(foldchanges) <- ttGlm2$entrez
kegg.mm = kegg.gsets("mouse", id.type = "entrez")
kegg.mm.sigmet = kegg.mm$kg.sets[kegg.mm$sigmet.idx]

# Lets get the results
keggres = gage(foldchanges, gsets=kegg.mm.sigmet, same.dir = TRUE)

lapply(keggres, head)

## $greater
##                                                                   p.geomean
## mmu03010 Ribosome                                                   9.5e-05
## mmu04550 Signaling pathways regulating pluripotency of stem cells   2.0e-03
## mmu04330 Notch signaling pathway                                    6.1e-03
## mmu04350 TGF-beta signaling pathway                                 1.3e-02
## mmu04390 Hippo signaling pathway                                    2.0e-02
## mmu00830 Retinol metabolism                                         2.1e-02
##                                                                   stat.mean
## mmu03010 Ribosome                                                       3.8
## mmu04550 Signaling pathways regulating pluripotency of stem cells       2.9
## mmu04330 Notch signaling pathway                                        2.6
## mmu04350 TGF-beta signaling pathway                                     2.2
## mmu04390 Hippo signaling pathway                                        2.1
## mmu00830 Retinol metabolism                                             2.1
##                                                                     p.val q.val
## mmu03010 Ribosome                                                 9.5e-05 0.023
## mmu04550 Signaling pathways regulating pluripotency of stem cells 2.0e-03 0.235
## mmu04330 Notch signaling pathway                                  6.1e-03 0.488
## mmu04350 TGF-beta signaling pathway                               1.3e-02 0.783
## mmu04390 Hippo signaling pathway                                  2.0e-02 0.826
## mmu00830 Retinol metabolism                                       2.1e-02 0.826
##                                                                   set.size
## mmu03010 Ribosome                                                      127
## mmu04550 Signaling pathways regulating pluripotency of stem cells      100
## mmu04330 Notch signaling pathway                                        54
## mmu04350 TGF-beta signaling pathway                                     84
## mmu04390 Hippo signaling pathway                                       125
## mmu00830 Retinol metabolism                                             37
##                                                                      exp1
## mmu03010 Ribosome                                                 9.5e-05
## mmu04550 Signaling pathways regulating pluripotency of stem cells 2.0e-03
## mmu04330 Notch signaling pathway                                  6.1e-03
## mmu04350 TGF-beta signaling pathway                               1.3e-02
## mmu04390 Hippo signaling pathway                                  2.0e-02
## mmu00830 Retinol metabolism                                       2.1e-02
## 
## $less
##                                                  p.geomean stat.mean   p.val
## mmu04613 Neutrophil extracellular trap formation   0.00012      -3.7 0.00012
## mmu04145 Phagosome                                 0.00192      -2.9 0.00192
## mmu04110 Cell cycle                                0.00276      -2.8 0.00276
## mmu04714 Thermogenesis                             0.00472      -2.6 0.00472
## mmu04217 Necroptosis                               0.00614      -2.5 0.00614
## mmu00910 Nitrogen metabolism                       0.00867      -2.6 0.00867
##                                                  q.val set.size    exp1
## mmu04613 Neutrophil extracellular trap formation 0.029      137 0.00012
## mmu04145 Phagosome                               0.221      121 0.00192
## mmu04110 Cell cycle                              0.221      134 0.00276
## mmu04714 Thermogenesis                           0.283      208 0.00472
## mmu04217 Necroptosis                             0.295      113 0.00614
## mmu00910 Nitrogen metabolism                     0.347       13 0.00867
## 
## $stats
##                                                                   stat.mean
## mmu03010 Ribosome                                                       3.8
## mmu04550 Signaling pathways regulating pluripotency of stem cells       2.9
## mmu04330 Notch signaling pathway                                        2.6
## mmu04350 TGF-beta signaling pathway                                     2.2
## mmu04390 Hippo signaling pathway                                        2.1
## mmu00830 Retinol metabolism                                             2.1
##                                                                   exp1
## mmu03010 Ribosome                                                  3.8
## mmu04550 Signaling pathways regulating pluripotency of stem cells  2.9
## mmu04330 Notch signaling pathway                                   2.6
## mmu04350 TGF-beta signaling pathway                                2.2
## mmu04390 Hippo signaling pathway                                   2.1
## mmu00830 Retinol metabolism                                        2.1

greaters <- keggres$greater
lessers <- keggres$less

keggrespathways = data.frame(id = rownames(keggres$greater), keggres$greater) %>%
  tbl_df() %>%
  filter(row_number()<=5) %>%
  .$id %>%
  as.character()

## Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
## ℹ Please use `tibble::as_tibble()` instead.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

  keggrespathways

## [1] "mmu03010 Ribosome"                                                
## [2] "mmu04550 Signaling pathways regulating pluripotency of stem cells"
## [3] "mmu04330 Notch signaling pathway"                                 
## [4] "mmu04350 TGF-beta signaling pathway"                              
## [5] "mmu04390 Hippo signaling pathway"

  keggresids = substr(keggrespathways, start=1, stop = 8)
keggresids

## [1] "mmu03010" "mmu04550" "mmu04330" "mmu04350" "mmu04390"

plot_pathways = function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = "mouse", new.signature = FALSE)

tmp = sapply(keggresids, function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = "mouse", kegg.native = FALSE))

## 'select()' returned 1:1 mapping between keys and columns

## Note: mmu03010 not rendered, 0 or 1 connected nodes!
## Try "kegg.native=T" instead!

## 'select()' returned 1:1 mapping between keys and columns

## Info: Working in directory /home/student/Desktop/classroom/myfiles

## Info: Writing image file mmu04550.pathview.pdf

## 'select()' returned 1:1 mapping between keys and columns

## Warning: reconcile groups sharing member nodes!

##      [,1] [,2]
## [1,] "13" "74"
## [2,] "13" "75"
## [3,] "13" "76"

## Info: Working in directory /home/student/Desktop/classroom/myfiles

## Info: Writing image file mmu04330.pathview.pdf

## 'select()' returned 1:1 mapping between keys and columns

## Warning: reconcile groups sharing member nodes!

##      [,1]  [,2] 
## [1,] "213" "234"
## [2,] "214" "234"
## [3,] "215" "234"
## [4,] "216" "234"
## [5,] "213" "235"
## [6,] "214" "235"
## [7,] "215" "235"
## [8,] "216" "235"

## Warning: Nodes are not the same type, hence unable to combine!

## Info: Working in directory /home/student/Desktop/classroom/myfiles

## Info: Writing image file mmu04350.pathview.pdf

## 'select()' returned 1:1 mapping between keys and columns

## Info: Working in directory /home/student/Desktop/classroom/myfiles

## Info: Writing image file mmu04390.pathview.pdf

keggrespathways = data.frame(id = rownames(keggres$less), keggres$less) %>%
  tbl_df() %>%
  filter(row_number()<=5) %>%
  .$id %>%
  as.character()

## Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
## ℹ Please use `tibble::as_tibble()` instead.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

  keggrespathways

## [1] "mmu04613 Neutrophil extracellular trap formation"
## [2] "mmu04145 Phagosome"                              
## [3] "mmu04110 Cell cycle"                             
## [4] "mmu04714 Thermogenesis"                          
## [5] "mmu04217 Necroptosis"

  keggresids = substr(keggrespathways, start=1, stop = 8)
keggresids

## [1] "mmu04613" "mmu04145" "mmu04110" "mmu04714" "mmu04217"

plot_pathways = function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = "mouse", new.signature = FALSE)

tmp = sapply(keggresids, function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = "mouse", kegg.native = FALSE))

## 'select()' returned 1:1 mapping between keys and columns

## Info: Working in directory /home/student/Desktop/classroom/myfiles

## Info: Writing image file mmu04613.pathview.pdf

## 'select()' returned 1:1 mapping between keys and columns

## Info: Working in directory /home/student/Desktop/classroom/myfiles

## Info: Writing image file mmu04145.pathview.pdf

## 'select()' returned 1:1 mapping between keys and columns

## Warning: reconcile groups sharing member nodes!

##      [,1] [,2] 
## [1,] "9"  "300"
## [2,] "9"  "306"

## Info: Working in directory /home/student/Desktop/classroom/myfiles

## Info: Writing image file mmu04110.pathview.pdf

## 'select()' returned 1:1 mapping between keys and columns

## Info: Working in directory /home/student/Desktop/classroom/myfiles

## Info: Writing image file mmu04714.pathview.pdf

## 'select()' returned 1:1 mapping between keys and columns

## Warning: reconcile groups sharing member nodes!

##      [,1]  [,2] 
## [1,] "226" "232"
## [2,] "228" "232"
## [3,] "226" "234"
## [4,] "228" "234"

## Info: Working in directory /home/student/Desktop/classroom/myfiles

## Info: Writing image file mmu04217.pathview.pdf

library(imager)

## Loading required package: magrittr

## 
## Attaching package: 'magrittr'

## The following object is masked from 'package:purrr':
## 
##     set_names

## The following object is masked from 'package:tidyr':
## 
##     extract

## 
## Attaching package: 'imager'

## The following object is masked from 'package:magrittr':
## 
##     add

## The following object is masked from 'package:Biostrings':
## 
##     width

## The following object is masked from 'package:XVector':
## 
##     width

## The following objects are masked from 'package:oligoClasses':
## 
##     B, B<-

## The following objects are masked from 'package:IRanges':
## 
##     resize, width

## The following object is masked from 'package:S4Vectors':
## 
##     width

## The following object is masked from 'package:Biobase':
## 
##     channel

## The following object is masked from 'package:BiocGenerics':
## 
##     width

## The following object is masked from 'package:stringr':
## 
##     boundary

## The following object is masked from 'package:dplyr':
## 
##     where

## The following object is masked from 'package:tidyr':
## 
##     fill

## The following objects are masked from 'package:stats':
## 
##     convolve, spectrum

## The following object is masked from 'package:graphics':
## 
##     frame

## The following object is masked from 'package:base':
## 
##     save.image

#This line doesnt work for some reason
filenames <- list.files(path = "~/Desktop/classroom/myfiles/meganbarr789_gmail_com", pattern = ".*pathview.png")


all_iages <- lapply(filenames, load.image)




knitr::include_graphics(filenames)

DESeq

Load the required libraries

library("DESeq2")

## Loading required package: GenomicRanges

## 
## Attaching package: 'GenomicRanges'

## The following object is masked from 'package:magrittr':
## 
##     subtract

## Loading required package: SummarizedExperiment

## Loading required package: MatrixGenerics

## Loading required package: matrixStats

## 
## Attaching package: 'matrixStats'

## The following objects are masked from 'package:Biobase':
## 
##     anyMissing, rowMedians

## The following object is masked from 'package:dplyr':
## 
##     count

## 
## Attaching package: 'MatrixGenerics'

## The following objects are masked from 'package:matrixStats':
## 
##     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
##     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
##     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
##     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
##     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
##     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
##     colWeightedMeans, colWeightedMedians, colWeightedSds,
##     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
##     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
##     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
##     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
##     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
##     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
##     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
##     rowWeightedSds, rowWeightedVars

## The following object is masked from 'package:Biobase':
## 
##     rowMedians

library("dplyr")
library("apeglm")

load in data

countData <- read.csv("GLDS-102_rna_seq_Unnormalized_Counts.csv")

colData <- read.csv("PHENO_DATA_mouse.csv", sep= ",", row.names = 1)

add leveling factors to colData

colData$group <- factor(colData$group, levels = c("0", "1"))

make sure there are the same number of individuals as well as groups

all(rownames(colData))%in% colnames(countData)

## Warning in all(rownames(colData)): coercing argument of type 'character' to
## logical

## [1] FALSE

round up the counts

countData %>%
  mutate_if(is.numeric, ceiling)

countData[, 2:13] <- sapply(countData[,2:13], as.integer)

row.names(countData) <- countData[,1]

countData <- countData[2:13]

rownames(colData) == colnames(countData)

dds <- DESeqDataSetFromMatrix(countData = countData, colData = colData, design = ~group)

dds <- dds[rowSums(counts(dds)>2) >=4]

dds <- DESeq(dds)

## estimating size factors

## estimating dispersions

## gene-wise dispersion estimates

## mean-dispersion relationship

## final dispersion estimates

## fitting model and testing

res <- results(dds)

resLFC <- lfcShrink(dds, coef = 2)

## using 'apeglm' for LFC shrinkage. If used in published research, please cite:
##     Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
##     sequence count data: removing the noise and preserving large differences.
##     Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895

(resOrdered <- res[order(res$padj), ])

## log2 fold change (MLE): group 1 vs 0 
## Wald test p-value: group 1 vs 0 
## DataFrame with 22008 rows and 6 columns
##                     baseMean log2FoldChange     lfcSE        stat      pvalue
##                    <numeric>      <numeric> <numeric>   <numeric>   <numeric>
## ENSMUSG00000002250  1459.223      -0.926292 0.1112628    -8.32526 8.41430e-17
## ENSMUSG00000007872  1719.375       0.818829 0.1122820     7.29261 3.04014e-13
## ENSMUSG00000026077   437.035      -1.191812 0.1655873    -7.19748 6.13338e-13
## ENSMUSG00000040998 14579.593      -0.506307 0.0703771    -7.19421 6.28252e-13
## ENSMUSG00000021775  2804.923       0.842511 0.1233312     6.83129 8.41546e-12
## ...                      ...            ...       ...         ...         ...
## ENSMUSG00000118345   4.22314    -0.12097478  0.599072 -0.20193699    0.839966
## ENSMUSG00000118353   6.60578     0.56456713  0.481195  1.17326031    0.240691
## ENSMUSG00000118358   3.30902    -0.00273584  0.763559 -0.00358301    0.997141
## ENSMUSG00000118369   2.91657    -1.11623145  0.790702 -1.41169702    0.158039
## ENSMUSG00000118384   7.43136     0.23830798  0.489273  0.48706567    0.626212
##                           padj
##                      <numeric>
## ENSMUSG00000002250 1.24077e-12
## ENSMUSG00000007872 2.24149e-09
## ENSMUSG00000026077 2.31605e-09
## ENSMUSG00000040998 2.31605e-09
## ENSMUSG00000021775 2.48189e-08
## ...                        ...
## ENSMUSG00000118345          NA
## ENSMUSG00000118353          NA
## ENSMUSG00000118358          NA
## ENSMUSG00000118369          NA
## ENSMUSG00000118384          NA

summary(res)

## 
## out of 22008 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 325, 1.5%
## LFC < 0 (down)     : 327, 1.5%
## outliers [1]       : 15, 0.068%
## low counts [2]     : 7247, 33%
## (mean count < 38)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

FLT_Vs_GC <- as.data.frame(res$log2FoldChange)

head(FLT_Vs_GC)

##   res$log2FoldChange
## 1             0.0421
## 2            -0.1334
## 3            -0.0185
## 4            -0.0882
## 5            -0.0079
## 6             0.1136

plotMA(resLFC, ylim = c(-2,2))

pdf(file = "MA_Plot_FLT_vs_GC.pdf")

dev.off()

## png 
##   2

save differential expression results to file.

write.csv(as.data.frame(resOrdered), file = "Mousse_DESeq.csv")

perform a custom transformation

dds <- estimateSizeFactors(dds)

se <- SummarizedExperiment(log2(counts(dds, normalize = TRUE)+1), colData = colData(dds))

pdf(file = "PCA_PLOT_FLT_vs_GC.pdf")

plotPCA(DESeqTransform(se), intgroup = "group")

## using ntop=500 top features by variance

load packages

library(AnnotationDbi)
library(org.Mm.eg.db)

columns(org.Mm.eg.db)

##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
##  [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
## [11] "GENETYPE"     "GO"           "GOALL"        "IPI"          "MGI"         
## [16] "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"        
## [21] "PROSITE"      "REFSEQ"       "SYMBOL"       "UNIPROT"

foldchanges <- as.data.frame(res$log2FoldChange, row.names = row.names(res))
head(foldchanges)

##                    res$log2FoldChange
## ENSMUSG00000000001             0.0421
## ENSMUSG00000000028            -0.1334
## ENSMUSG00000000031            -0.0185
## ENSMUSG00000000037            -0.0882
## ENSMUSG00000000049            -0.0079
## ENSMUSG00000000056             0.1136

res$symbol = mapIds(org.Mm.eg.db,
                    keys = row.names(res),
                    column = "SYMBOL",
                    keytype = "ENSEMBL",
                    multiVals = "first")

## 'select()' returned 1:many mapping between keys and columns

res$entrez = mapIds(org.Mm.eg.db,
                    keys = row.names(res),
                    column = "ENTREZID",
                    keytype = "ENSEMBL",
                    multiVals = "first")

## 'select()' returned 1:many mapping between keys and columns

res$name = mapIds(org.Mm.eg.db,
                    keys = row.names(res),
                    column = "GENENAME",
                    keytype = "ENSEMBL",
                    multiVals = "first")

## 'select()' returned 1:many mapping between keys and columns

head(res, 10)

## log2 fold change (MLE): group 1 vs 0 
## Wald test p-value: group 1 vs 0 
## DataFrame with 10 rows and 9 columns
##                      baseMean log2FoldChange     lfcSE       stat    pvalue
##                     <numeric>      <numeric> <numeric>  <numeric> <numeric>
## ENSMUSG00000000001 3132.35128     0.04214340 0.0436714  0.9650117  0.334539
## ENSMUSG00000000028   68.75801    -0.13342706 0.1565936 -0.8520597  0.394181
## ENSMUSG00000000031   21.05397    -0.01853142 0.2486477 -0.0745288  0.940590
## ENSMUSG00000000037   24.42314    -0.08817270 0.2982220 -0.2956613  0.767489
## ENSMUSG00000000049    3.24919    -0.00790342 0.9613572 -0.0082211  0.993441
## ENSMUSG00000000056 1424.88216     0.11355979 0.0777635  1.4603234  0.144201
## ENSMUSG00000000058 1420.78992    -0.03893850 0.0850602 -0.4577759  0.647113
## ENSMUSG00000000078 2254.53129    -0.07540275 0.0874314 -0.8624214  0.388456
## ENSMUSG00000000085  822.68179     0.04586772 0.0584699  0.7844667  0.432766
## ENSMUSG00000000088 5946.81754    -0.05461549 0.0691178 -0.7901795  0.429423
##                         padj      symbol      entrez                   name
##                    <numeric> <character> <character>            <character>
## ENSMUSG00000000001  0.739637       Gnai3       14679 G protein subunit al..
## ENSMUSG00000000028  0.777085       Cdc45       12544 cell division cycle 45
## ENSMUSG00000000031        NA         H19       14955 H19, imprinted mater..
## ENSMUSG00000000037        NA       Scml2      107815 Scm polycomb group p..
## ENSMUSG00000000049        NA        Apoh       11818       apolipoprotein H
## ENSMUSG00000000056  0.547527        Narf       67608 nuclear prelamin A r..
## ENSMUSG00000000058  0.901904        Cav2       12390             caveolin 2
## ENSMUSG00000000078  0.774294        Klf6       23849 Kruppel-like transcr..
## ENSMUSG00000000085  0.800018       Scmh1       29871 sex comb on midleg h..
## ENSMUSG00000000088  0.798420       Cox5a       12858 cytochrome c oxidase..

pathview packages

library(pathview)
library(gage)
library(gageData)
data("kegg.sets.mm")
data("go.subs.mm")

foldchanges <- res$log2FoldChange
names(foldchanges) = res$entrez

head(foldchanges)

##   14679   12544   14955  107815   11818   67608 
##  0.0421 -0.1334 -0.0185 -0.0882 -0.0079  0.1136

kegg.mm <- kegg.gsets("mouse", id.type = "entrez")
kegg.mm.sigmet <- kegg.mm$kg.sets[kegg.mm$sigmet.idx]

map the results

keggres <- gage(foldchanges, gsets = kegg.mm.sigmet, same.dir = TRUE)

lapply(keggres, head)

## $greater
##                                           p.geomean stat.mean p.val q.val
## mmu03010 Ribosome                             0.017       2.1 0.017  0.89
## mmu04022 cGMP-PKG signaling pathway           0.030       1.9 0.030  0.89
## mmu04360 Axon guidance                        0.038       1.8 0.038  0.89
## mmu04330 Notch signaling pathway              0.042       1.8 0.042  0.89
## mmu04658 Th1 and Th2 cell differentiation     0.054       1.6 0.054  0.89
## mmu02010 ABC transporters                     0.075       1.5 0.075  0.89
##                                           set.size  exp1
## mmu03010 Ribosome                              139 0.017
## mmu04022 cGMP-PKG signaling pathway            152 0.030
## mmu04360 Axon guidance                         176 0.038
## mmu04330 Notch signaling pathway                58 0.042
## mmu04658 Th1 and Th2 cell differentiation       78 0.054
## mmu02010 ABC transporters                       46 0.075
## 
## $less
##                                                   p.geomean stat.mean  p.val
## mmu04613 Neutrophil extracellular trap formation     0.0043      -2.6 0.0043
## mmu04110 Cell cycle                                  0.0062      -2.5 0.0062
## mmu04657 IL-17 signaling pathway                     0.0113      -2.3 0.0113
## mmu04145 Phagosome                                   0.0332      -1.8 0.0332
## mmu04621 NOD-like receptor signaling pathway         0.0388      -1.8 0.0388
## mmu04625 C-type lectin receptor signaling pathway    0.0441      -1.7 0.0441
##                                                   q.val set.size   exp1
## mmu04613 Neutrophil extracellular trap formation   0.75      160 0.0043
## mmu04110 Cell cycle                                0.75      151 0.0062
## mmu04657 IL-17 signaling pathway                   0.90       75 0.0113
## mmu04145 Phagosome                                 0.91      144 0.0332
## mmu04621 NOD-like receptor signaling pathway       0.91      154 0.0388
## mmu04625 C-type lectin receptor signaling pathway  0.91       99 0.0441
## 
## $stats
##                                           stat.mean exp1
## mmu03010 Ribosome                               2.1  2.1
## mmu04022 cGMP-PKG signaling pathway             1.9  1.9
## mmu04360 Axon guidance                          1.8  1.8
## mmu04330 Notch signaling pathway                1.8  1.8
## mmu04658 Th1 and Th2 cell differentiation       1.6  1.6
## mmu02010 ABC transporters                       1.5  1.5

save greater and less than pathways

greaters <- keggres$greater
lessers <- keggres$less

keggrespathways <- data.frame(id = rownames(keggres$greater), keggres$greater) %>%
  tbl_df() %>%
  filter(row_number() <= 3) %>%
  .$id %>%
  as.character()

## Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
## ℹ Please use `tibble::as_tibble()` instead.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

keggrespathways

## [1] "mmu03010 Ribosome"                   "mmu04022 cGMP-PKG signaling pathway"
## [3] "mmu04360 Axon guidance"

keggresids <- substr(keggrespathways, start = 1, stop = 8)
keggresids

## [1] "mmu03010" "mmu04022" "mmu04360"

PLOT!

plot_pathway = function(pid) pathviiew(gene.data = foldchange, pathway.id = pid, specied = "mouse", new.signature = FALSE)

tmp = sapply(keggresids, function(pid) pathview(gene.data = foldchanges, pathway.id = pid, speceis = "mouse"))

## Info: Downloading xml files for hsammu03010, 1/1 pathways..

## Warning in download.file(xml.url, xml.target, quiet = T): cannot open URL
## 'https://rest.kegg.jp/get/hsammu03010/kgml': HTTP status was '404 Not Found'

## Warning: Download of hsammu03010 xml file failed!
## This pathway may not exist!

## Info: Downloading png files for hsammu03010, 1/1 pathways..

## Warning: Download of hsammu03010 png file failed!
## This pathway may not exist!

## Warning: Failed to download KEGG xml/png files, hsammu03010 skipped!

## Info: Downloading xml files for hsammu04022, 1/1 pathways..

## Warning in download.file(xml.url, xml.target, quiet = T): cannot open URL
## 'https://rest.kegg.jp/get/hsammu04022/kgml': HTTP status was '404 Not Found'

## Warning: Download of hsammu04022 xml file failed!
## This pathway may not exist!

## Info: Downloading png files for hsammu04022, 1/1 pathways..

## Warning: Download of hsammu04022 png file failed!
## This pathway may not exist!

## Warning: Failed to download KEGG xml/png files, hsammu04022 skipped!

## Info: Downloading xml files for hsammu04360, 1/1 pathways..

## Warning in download.file(xml.url, xml.target, quiet = T): cannot open URL
## 'https://rest.kegg.jp/get/hsammu04360/kgml': HTTP status was '404 Not Found'

## Warning: Download of hsammu04360 xml file failed!
## This pathway may not exist!

## Info: Downloading png files for hsammu04360, 1/1 pathways..

## Warning: Download of hsammu04360 png file failed!
## This pathway may not exist!

## Warning: Failed to download KEGG xml/png files, hsammu04360 skipped!

keggrespathways <- data.frame(id = rownames(keggres$less), keggres$less) %>%
  tbl_df() %>%
  filter(row_number() <= 3) %>%
  .$id %>%
  as.character()

## Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
## ℹ Please use `tibble::as_tibble()` instead.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

keggrespathways

## [1] "mmu04613 Neutrophil extracellular trap formation"
## [2] "mmu04110 Cell cycle"                             
## [3] "mmu04657 IL-17 signaling pathway"

keggresids <- substr(keggrespathways, start = 1, stop = 8)
keggresids

## [1] "mmu04613" "mmu04110" "mmu04657"

PLOT!

plot_pathway = function(pid) pathviiew(gene.data = foldchange, pathway.id = pid, specied = "mouse", new.signature = FALSE)

tmp = sapply(keggresids, function(pid) pathview(gene.data = foldchanges, pathway.id = pid, speceis = "mouse"))

## Info: Downloading xml files for hsammu04613, 1/1 pathways..

## Warning in download.file(xml.url, xml.target, quiet = T): cannot open URL
## 'https://rest.kegg.jp/get/hsammu04613/kgml': HTTP status was '404 Not Found'

## Warning: Download of hsammu04613 xml file failed!
## This pathway may not exist!

## Info: Downloading png files for hsammu04613, 1/1 pathways..

## Warning: Download of hsammu04613 png file failed!
## This pathway may not exist!

## Warning: Failed to download KEGG xml/png files, hsammu04613 skipped!

## Info: Downloading xml files for hsammu04110, 1/1 pathways..

## Warning in download.file(xml.url, xml.target, quiet = T): cannot open URL
## 'https://rest.kegg.jp/get/hsammu04110/kgml': HTTP status was '404 Not Found'

## Warning: Download of hsammu04110 xml file failed!
## This pathway may not exist!

## Info: Downloading png files for hsammu04110, 1/1 pathways..

## Warning: Download of hsammu04110 png file failed!
## This pathway may not exist!

## Warning: Failed to download KEGG xml/png files, hsammu04110 skipped!

## Info: Downloading xml files for hsammu04657, 1/1 pathways..

## Warning in download.file(xml.url, xml.target, quiet = T): cannot open URL
## 'https://rest.kegg.jp/get/hsammu04657/kgml': HTTP status was '404 Not Found'

## Warning: Download of hsammu04657 xml file failed!
## This pathway may not exist!

## Info: Downloading png files for hsammu04657, 1/1 pathways..

## Warning: Download of hsammu04657 png file failed!
## This pathway may not exist!

## Warning: Failed to download KEGG xml/png files, hsammu04657 skipped!

library(imager)

filenames <- list.files(path = "~/Desktop/classroom/myfiles/meganbarr789", pattern = ".*pathview.png")

all_images <- lapply(filenames, load.image)

knitr::include_graphics(filenames)

EdgeR vs DESeq

library(tidyverse)

EdgeR <- read.csv("Mouse_EdgeR_Results_Zero_vs_1")

DESeq <- read.csv("Mouse_DESeq.csv")

DESeq2 <- DESeq %>%
  filter(padj < 0.05)

DESeq2 <- DESeq2[,c(1,3)]

EdgeR <- EdgeR[, 1:2]

colnames(DESeq2) <- c("ID", "logFC" )
colnames(EdgeR) <- c("ID", "logFC")

install.packages("GOplot")

## Installing package into '/home/student/R/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)

library(GOplot)

## Loading required package: ggdendro

## 
## Attaching package: 'ggdendro'

## The following object is masked from 'package:imager':
## 
##     label

## Loading required package: gridExtra

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:Biobase':
## 
##     combine

## The following object is masked from 'package:BiocGenerics':
## 
##     combine

## The following object is masked from 'package:dplyr':
## 
##     combine

## Loading required package: RColorBrewer

comp <- GOVenn(DESeq2, EdgeR, label = c("DESeq1", "EdgeR"), title = "comparison of DESeq and EdgeR DE Genes", plot = FALSE)

comp$plot

Other Tools

Protien Alignment

library(msa)

seq <- readAAStringSet("hglobin.fa")

alignments <- msa(seq, method = "ClustalW")

## use default substitution matrix

alignments

## CLUSTAL 2.1  
## 
## Call:
##    msa(seq, method = "ClustalW")
## 
## MsaAAMultipleAlignment with 3 rows and 142 columns
##     aln                                                    names
## [1] MVLSPADKTNVKAAWGKVGAHAGEYG...FTPAVHASLDKFLASVSTVLTSKYR HBA_HUMAN
## [2] MVLSGEDKSNIKAAWGKIGGHGAEYG...FTPAVHASLDKFLASVSTVLTSKYR HBA_MOUSE
## [3] MSLTRTERTIILSLWSKISTQADVIG...FTADAHAAWDKFLSIVSGVLTEKYR HBAZ_CAPHI
## Con MVLS??DKTNIKAAWGKIG?HA?EYG...FTPAVHASLDKFLASVSTVLTSKYR Consensus

msaPrettyPrint(alignments, output = "pdf", showNames = "left", 
               showLogo = "none", askForOverwrite = FALSE, 
               verbose = TRUE, file = "whole_aligh.pdf")

## Multiple alignment written to temporary file /tmp/RtmpBHBxcE/seqa907e627683.fasta

## File whole_aligh.tex created

## Warning in texi2dvi(texfile, quiet = !verbose, pdf = identical(output, "pdf"),
## : texi2dvi script/program not available, using emulation

## Output file whole_aligh.pdf created

msaPrettyPrint(alignments, c(10,30), output = "pdf", showNames = "left",
               file = "Zoomed_align.pdf", showLogo = "top", askForOverwrite = FALSE,
               verbose = TRUE)

## Multiple alignment written to temporary file /tmp/RtmpBHBxcE/seqa907c2695c7.fasta

## File Zoomed_align.tex created

## Warning in texi2dvi(texfile, quiet = !verbose, pdf = identical(output, "pdf"),
## : texi2dvi script/program not available, using emulation

## Output file Zoomed_align.pdf created

Lets make a tree from out alignment

library(ape)

## 
## Attaching package: 'ape'

## The following object is masked from 'package:imager':
## 
##     where

## The following object is masked from 'package:Biostrings':
## 
##     complement

## The following object is masked from 'package:dplyr':
## 
##     where

library(seqinr)

## 
## Attaching package: 'seqinr'

## The following objects are masked from 'package:ape':
## 
##     as.alignment, consensus

## The following object is masked from 'package:matrixStats':
## 
##     count

## The following object is masked from 'package:Biostrings':
## 
##     translate

## The following object is masked from 'package:limma':
## 
##     zscore

## The following object is masked from 'package:dplyr':
## 
##     count

alignment_seqinr <- msaConvert(alignments, type = "seqinr::alignment")

distances1 <- seqinr::dist.alignment(alignment_seqinr, "identity")

tree <- ape::nj(distances1)
plot(tree,main = "Phylogenetic Tree of HBA sequences")

Synteny

library(DECIPHER)

## Loading required package: RSQLite

## Loading required package: parallel

long_seq <- readDNAStringSet("plastid_genomes.fa")
long_seq

## DNAStringSet object of length 5:
##      width seq                                              names               
## [1] 130584 GGCATAAGCTATCTTCCCAAAGG...ATGATTCAAACATAAAAGTCCT NC_018523.1 Sacch...
## [2] 161592 ATGGGCGAACGACGGGAATTGAA...AAAGAAAAAAAAATAGGAGTAA NC_022431.1 Ascle...
## [3] 117672 ATGAGTACAACTCGAAAGTCCAC...TTGATTTCATCCACAAACGAAC NC_022259.1 Nanno...
## [4] 154731 TTATCCATTTGTAGATGGAACTT...CATATACACTAAGACAAAAGTC NC_022417.1 Cocos...
## [5] 156618 GGGCGAACGACGGGAATTGAACC...CCTTTTGTAGCGAATCCGTTAT NC_022459.1 Camel...

Temporary SQLite database

Seqs2DB(long_seq, "XStringSet", "long_db", names(long_seq))

## Adding 5 sequences to the database.
## 
## Added 5 new sequences to table Seqs.
## 65 total sequences in table Seqs.
## Time difference of 0.21 secs

synteny <- FindSynteny("long_db")

## ================================================================================
## 
## Time difference of 9.8 secs

View blocs with plotting

pairs(synteny)

plot(synteny)

alignment file

alignment <- AlignSynteny(synteny, "long_db")

## ================================================================================
## 
## Time difference of 88 secs

block <- unlist(alignment[[1]])

writeXStringSet(block, "genome_blocks_out.fa")

Unannotated Regions

library(locfit)

## locfit 1.5-9.8    2023-06-11

## 
## Attaching package: 'locfit'

## The following object is masked from 'package:purrr':
## 
##     none

library(Rsamtools)


get_annotated_regions_from_gff <- function(file_name) {
  gff <- rtracklayer::import.gff(file_name)
  as(gff, "GRanges")
  
}

whole_genome <- csaw::windowCounts(
  # file.path(~Desktop/classroom/myfiles)
  file.path( "windows.bam"),
  bin = TRUE,
  filter = 0,
  width = 500,
  param = csaw::readParam(
    minq = 20, # determines what is a passing read
    dedup = TRUE, # removes pcr duplicates
    pe = "both" # requires that both pairs of reads are aligned
  )
)


 colnames(whole_genome) <- c("small_data") 
 
annotated_regions <- get_annotated_regions_from_gff(("genes.gff"))

Phylogenetic Analysis

Treeio

library(ggplot2)
library(ggtree)

## ggtree v3.10.0 For help: https://yulab-smu.top/treedata-book/
## 
## If you use the ggtree package suite in published research, please cite
## the appropriate paper(s):
## 
## Guangchuang Yu, David Smith, Huachen Zhu, Yi Guan, Tommy Tsan-Yuk Lam.
## ggtree: an R package for visualization and annotation of phylogenetic
## trees with their covariates and other associated data. Methods in
## Ecology and Evolution. 2017, 8(1):28-36. doi:10.1111/2041-210X.12628
## 
## S Xu, Z Dai, P Guo, X Fu, S Liu, L Zhou, W Tang, T Feng, M Chen, L
## Zhan, T Wu, E Hu, Y Jiang, X Bo, G Yu. ggtreeExtra: Compact
## visualization of richly annotated phylogenetic data. Molecular Biology
## and Evolution. 2021, 38(9):4039-4042. doi: 10.1093/molbev/msab166
## 
## G Yu. Data Integration, Manipulation and Visualization of Phylogenetic
## Trees (1st ed.). Chapman and Hall/CRC. 2022. ISBN: 9781032233574

## 
## Attaching package: 'ggtree'

## The following object is masked from 'package:ape':
## 
##     rotate

## The following object is masked from 'package:magrittr':
## 
##     inset

## The following object is masked from 'package:reshape':
## 
##     expand

## The following object is masked from 'package:Biostrings':
## 
##     collapse

## The following object is masked from 'package:IRanges':
## 
##     collapse

## The following object is masked from 'package:S4Vectors':
## 
##     expand

## The following object is masked from 'package:tidyr':
## 
##     expand

library(treeio)

## treeio v1.26.0 For help: https://yulab-smu.top/treedata-book/
## 
## If you use the ggtree package suite in published research, please cite
## the appropriate paper(s):
## 
## LG Wang, TTY Lam, S Xu, Z Dai, L Zhou, T Feng, P Guo, CW Dunn, BR
## Jones, T Bradley, H Zhu, Y Guan, Y Jiang, G Yu. treeio: an R package
## for phylogenetic tree input and output with richly annotated and
## associated data. Molecular Biology and Evolution. 2020, 37(2):599-603.
## doi: 10.1093/molbev/msz240
## 
## Guangchuang Yu. Using ggtree to visualize data on tree-like structures.
## Current Protocols in Bioinformatics. 2020, 69:e96. doi:10.1002/cpbi.96
## 
## Guangchuang Yu.  Data Integration, Manipulation and Visualization of
## Phylogenetic Trees (1st edition). Chapman and Hall/CRC. 2022,
## doi:10.1201/9781003279242

## 
## Attaching package: 'treeio'

## The following object is masked from 'package:seqinr':
## 
##     read.fasta

## The following object is masked from 'package:Biostrings':
## 
##     mask

itol <- ape::read.tree("itol.nwk")

ggtree(itol)

ggtree(itol, layout = "circular")

ggtree(itol) + coord_flip() + scale_x_reverse()

ggtree(itol) + geom_tiplab(color = "indianred", size = 1.5)

ggtree(itol, layour = "unrooted") + geom_strip(13, 14, color = "red", barsize = 1)

## Warning in stat_tree(data = data, mapping = mapping, geom = "segment", position = position, : Ignoring unknown parameters: `layour`
## Ignoring unknown parameters: `layour`

ggtree(itol, layout = "unrooted") + geom_hilight(node = 11, type = "encircle", fill = "steelblue")

## "daylight" method was used as default layout for unrooted tree.

## Average angle change [1] 0.174910612627282

## Average angle change [2] 0.161645191380673

## Average angle change [3] 0.129304375923319

## Average angle change [4] 0.0825706767962636

## Average angle change [5] 0.100056259084497

install.packages('BAMMtools')

## Installing package into '/home/student/R/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)

library(BAMMtools)

getmrca(itol, "Photorhabdus_luminescens", "Blochmannia_floridanus")

## [1] 206

ggtree(itol, layout = "unrooted") + geom_hilight(node = 206, type = "encircle", fill = "steelblue")

## "daylight" method was used as default layout for unrooted tree.

## Average angle change [1] 0.174910612627282

## Average angle change [2] 0.161645191380673

## Average angle change [3] 0.129304375923319

## Average angle change [4] 0.0825706767962636

## Average angle change [5] 0.100056259084497

Treespace

library(ape)
library(adegraphics)

## Registered S3 methods overwritten by 'adegraphics':
##   method         from
##   biplot.dudi    ade4
##   kplot.foucart  ade4
##   kplot.mcoa     ade4
##   kplot.mfa      ade4
##   kplot.pta      ade4
##   kplot.sepan    ade4
##   kplot.statis   ade4
##   scatter.coa    ade4
##   scatter.dudi   ade4
##   scatter.nipals ade4
##   scatter.pco    ade4
##   score.acm      ade4
##   score.mix      ade4
##   score.pca      ade4
##   screeplot.dudi ade4

## 
## Attaching package: 'adegraphics'

## The following object is masked from 'package:ape':
## 
##     zoom

## The following object is masked from 'package:GenomicRanges':
## 
##     score

## The following object is masked from 'package:Biostrings':
## 
##     score

## The following object is masked from 'package:BiocGenerics':
## 
##     score

library(treespace)

## Loading required package: ade4

## 
## Attaching package: 'ade4'

## The following objects are masked from 'package:adegraphics':
## 
##     kplotsepan.coa, s.arrow, s.class, s.corcircle, s.distri, s.image,
##     s.label, s.logo, s.match, s.traject, s.value, table.value,
##     triangle.class

## The following object is masked from 'package:GenomicRanges':
## 
##     score

## The following object is masked from 'package:Biostrings':
## 
##     score

## The following object is masked from 'package:BiocGenerics':
## 
##     score

treefiles <- list.files(file.path(getwd(), "gene_trees"), full.names = TRUE)
tree_list <- lapply(treefiles, read.tree)
class(tree_list)

## [1] "list"

class(tree_list) <- "multiPhylo"

class(tree_list)

## [1] "multiPhylo"

comparisons <- treespace(tree_list, nf = 3)

table.image(comparisons$D, nclass= 25)

plotGroves(comparisons$pco, lab.show = TRUE, lab.cex = 1.5)

groves <- findGroves(comparisons, nclust = 4)
plotGroves(groves)

Subtrees

library(ape)

newick <- read.tree("mammal_tree.nwk")

l <- subtrees(newick)

plot(newick)

plot(l[[4]], sub = "Node 4")

small_tree <- extract.clade(newick, 9)

plot(small_tree)

new_tree <- bind.tree(newick, small_tree, 3)

plot(new_tree)

Phangorn

library(Biostrings)
library(msa)
library(phangorn)

seqs <- readAAStringSet("abc.fa")

aln <- msa::msa(seqs, method = c("ClustalOmega"))

## using Gonnet

aln <- as.phyDat(aln, type = "AA")

class(aln)

## [1] "phyDat"

dist_mat <- dist.ml(aln)

upgma_tree <- upgma(dist_mat)
plot(upgma_tree, mian = "UPGMA")

## Warning in plot.window(...): "mian" is not a graphical parameter

## Warning in plot.xy(xy, type, ...): "mian" is not a graphical parameter

## Warning in title(...): "mian" is not a graphical parameter

nj_tree <- NJ(dist_mat)
plot(nj_tree, main = "NJ")

fit <- pml(nj_tree, aln)

bootstraps <- bootstrap.phyDat(aln, FUN = function(x) (NJ(dist.ml(x))), bs = 100)

plotBS(nj_tree, bootstraps, p =10)

GWAS

Variant Tools

library(GenomicRanges)
library(gmapR)
library(VariantAnnotation)

## 
## Attaching package: 'VariantAnnotation'

## The following object is masked from 'package:stringr':
## 
##     fixed

## The following object is masked from 'package:base':
## 
##     tabulate

library(VariantTools)


bam_file <- file.path(getwd(), "hg17_snps.bam")

fasta_file <- file.path(getwd(), "chr17_83k.fa")

fa <- rtracklayer::FastaFile(fasta_file)

genome <- gmapR::GmapGenome(fa, create = TRUE)

## Creating directory /home/student/.local/share/gmap

qual_params <- TallyVariantsParam(
  genome = genome,
  minimum_mapq = 20)

var_params <- VariantCallingFilters(read.count = 19, p.lower = 0.01)
                                    
called_variants <- callVariants(bam_file,
                                qual_params,
                                calling.filter = var_params)
head(called_variants)

## VRanges object with 6 ranges and 17 metadata columns:
##           seqnames    ranges strand         ref              alt     totalDepth
##              <Rle> <IRanges>  <Rle> <character> <characterOrRle> <integerOrRle>
##   [1] NC_000017.10        64      *           G                T            759
##   [2] NC_000017.10        69      *           G                T            812
##   [3] NC_000017.10        70      *           G                T            818
##   [4] NC_000017.10        73      *           T                A            814
##   [5] NC_000017.10        77      *           T                A            802
##   [6] NC_000017.10        78      *           G                T            798
##             refDepth       altDepth   sampleNames softFilterMatrix | n.read.pos
##       <integerOrRle> <integerOrRle> <factorOrRle>         <matrix> |  <integer>
##   [1]            739             20          <NA>                  |         17
##   [2]            790             22          <NA>                  |         19
##   [3]            796             22          <NA>                  |         20
##   [4]            795             19          <NA>                  |         13
##   [5]            780             22          <NA>                  |         19
##   [6]            777             21          <NA>                  |         17
##       n.read.pos.ref raw.count.total count.plus count.plus.ref count.minus
##            <integer>       <integer>  <integer>      <integer>   <integer>
##   [1]             64             759         20            739           0
##   [2]             69             812         22            790           0
##   [3]             70             818         22            796           0
##   [4]             70             814         19            795           0
##   [5]             70             802         22            780           0
##   [6]             70             798         21            777           0
##       count.minus.ref count.del.plus count.del.minus read.pos.mean
##             <integer>      <integer>       <integer>     <numeric>
##   [1]               0              0               0       30.9000
##   [2]               0              0               0       40.7273
##   [3]               0              0               0       34.7727
##   [4]               0              0               0       36.1579
##   [5]               0              0               0       38.3636
##   [6]               0              0               0       39.7143
##       read.pos.mean.ref read.pos.var read.pos.var.ref     mdfne mdfne.ref
##               <numeric>    <numeric>        <numeric> <numeric> <numeric>
##   [1]           32.8755      318.558          347.804        NA        NA
##   [2]           35.4190      377.004          398.876        NA        NA
##   [3]           36.3442      497.762          402.360        NA        NA
##   [4]           36.2176      519.551          402.843        NA        NA
##   [5]           36.0064      472.327          397.070        NA        NA
##   [6]           35.9241      609.076          390.463        NA        NA
##       count.high.nm count.high.nm.ref
##           <integer>         <integer>
##   [1]            20               738
##   [2]            22               789
##   [3]            22               796
##   [4]            19               769
##   [5]            22               780
##   [6]            21               777
##   -------
##   seqinfo: 1 sequence from chr17_83k genome
##   hardFilters(4): nonRef nonNRef readCount likelihoodRatio

get_annotated_regions_from_gff <- function(file_name) {
  gff <- rtracklayer::import.gff(file_name)
  as(gff,"GRanges")
}

genes <- get_annotated_regions_from_gff("chr17.83k.gff3")

overlaps <- GenomicRanges::findOverlaps(called_variants, genes)

overlaps

## Hits object with 12684 hits and 0 metadata columns:
##           queryHits subjectHits
##           <integer>   <integer>
##       [1]     35176           1
##       [2]     35176           2
##       [3]     35176           3
##       [4]     35177           1
##       [5]     35177           2
##       ...       ...         ...
##   [12680]     40944           2
##   [12681]     40944           7
##   [12682]     40945           1
##   [12683]     40945           2
##   [12684]     40945           7
##   -------
##   queryLength: 44949 / subjectLength: 8

identified <- genes[subjectHits(overlaps)]

Prediciting ORFs

library(Biostrings)
library(systemPipeR)

## Loading required package: ShortRead

## Loading required package: BiocParallel

## Loading required package: GenomicAlignments

## 
## Attaching package: 'GenomicAlignments'

## The following object is masked from 'package:dplyr':
## 
##     last

## 
## Attaching package: 'ShortRead'

## The following object is masked from 'package:adegraphics':
## 
##     zoom

## The following objects are masked from 'package:locfit':
## 
##     left, right

## The following object is masked from 'package:ape':
## 
##     zoom

## The following object is masked from 'package:imager':
## 
##     clean

## The following object is masked from 'package:magrittr':
## 
##     functions

## The following object is masked from 'package:oligo':
## 
##     intensity

## The following objects are masked from 'package:oligoClasses':
## 
##     chromosome, position

## The following object is masked from 'package:affy':
## 
##     intensity

## The following object is masked from 'package:dplyr':
## 
##     id

## The following object is masked from 'package:purrr':
## 
##     compose

## The following object is masked from 'package:tibble':
## 
##     view

## 
## Attaching package: 'systemPipeR'

## The following object is masked from 'package:VariantAnnotation':
## 
##     reference

## The following object is masked from 'package:DESeq2':
## 
##     results

dna_object <- readDNAStringSet("arabidopsis_chloroplast.fa")

predict_orfs <- predORF(dna_object, n = 'all', type = 'gr', mode = 'ORF', strand = 'both',
                        longest_disjoint = TRUE)

predict_orfs

## GRanges object with 2501 ranges and 2 metadata columns:
##           seqnames        ranges strand | subject_id inframe2end
##              <Rle>     <IRanges>  <Rle> |  <integer>   <numeric>
##      1 chloroplast   86762-93358      + |          1           2
##   1162 chloroplast     2056-2532      - |          1           3
##      2 chloroplast   72371-73897      + |          2           2
##   1163 chloroplast   77901-78362      - |          2           1
##      3 chloroplast   54937-56397      + |          3           3
##    ...         ...           ...    ... .        ...         ...
##   2497 chloroplast 129757-129762      - |       1336           3
##   2498 chloroplast 139258-139263      - |       1337           3
##   2499 chloroplast 140026-140031      - |       1338           3
##   2500 chloroplast 143947-143952      - |       1339           3
##   2501 chloroplast 153619-153624      - |       1340           3
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

bases <- c("A", "T", "G", "C")
raw_seq_string <- strsplit(as.character(dna_object), "")

seq_length <- width(dna_object[1])

counts <- lapply(bases, function(x) {sum(grepl(x, raw_seq_string))})

probs <- unlist(lapply(counts, function(base_count){signif(base_count/seq_length,2)}))

probs

## [1] 6.5e-06 6.5e-06 6.5e-06 6.5e-06

get_longest_orf_in_random_genome <- function(x, 
                                             length = 1000,
                                             probs = c(0.25, 0.25, 0.25, 0.25),
                                             bases = c("A", "T", "G", "C")){

random_genome <- paste0(sample(bases, size = length, replace = TRUE, prob = probs), collapse = "")

random_dna_object <- DNAStringSet(random_genome)

names(random_dna_object) <-("random_dna_string")

orfs <- predORF(random_dna_object, n =1, type = 'gr', mode = 'ORF', strand = 'both', longest_disjoint = TRUE)
return(max(width(orfs)))
}

Use the function just created to predict the ORFs in 10 rando genomes

random_lengths <- unlist(lapply(1:10, get_longest_orf_in_random_genome, length = seq_length, probs = probs, bases = bases))

Pull out the longest length from 10 simulations

longest_random_orf <- max(random_lengths)

only keep the frames that are longer in chloroplast genome

keep <- width(predict_orfs) > longest_random_orf

orfs_to_keep <- predict_orfs[keep]
orfs_to_keep

## GRanges object with 9 ranges and 2 metadata columns:
##        seqnames        ranges strand | subject_id inframe2end
##           <Rle>     <IRanges>  <Rle> |  <integer>   <numeric>
##   1 chloroplast   86762-93358      + |          1           2
##   2 chloroplast   72371-73897      + |          2           2
##   3 chloroplast   54937-56397      + |          3           3
##   4 chloroplast   57147-58541      + |          4           1
##   5 chloroplast   33918-35141      + |          5           1
##   6 chloroplast   32693-33772      + |          6           2
##   7 chloroplast 109408-110436      + |          7           3
##   8 chloroplast 114461-115447      + |          8           2
##   9 chloroplast 141539-142276      + |          9           2
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

write this data to file

extracted_orfs <- BSgenome::getSeq(dna_object, orfs_to_keep)
names(extracted_orfs) <- paste0("orf_", 1:length(orfs_to_keep))
writeXStringSet(extracted_orfs, "saved_orfs.fa")

Karyoploter

library(karyoploteR)

## Loading required package: regioneR

library(GenomicRanges)

setwd("~/Desktop/classroom/myfiles/meganbarr789")

genome_df <- data.frame(
  chr = paste0("chr", 1:5),
  start = rep(1, 5),
  end = c(34964571, 22037565, 25499034, 20862711, 31270811)
)

Convert the dataframe to a granges object

genome_gr <- makeGRangesFromDataFrame(genome_df)

Create soe snp positions to map out.

snp_pos <- sample(1:1e7, 25)
snps <- data.frame(
  chr = paste0("chr", sample(1:5,25, replace = TRUE)),
  start = snp_pos,
  end = snp_pos
)

convert the dataframe to granges

snps_gr <- makeGRangesFromDataFrame(snps)

create snp labels

snp_labels <- paste0("snp_", 1:25)

set margins for the plot

plot.params <- getDefaultPlotParams(plot.type = 1)

plot.params$dataloutmargin <- 600

Plot the snps

kp <- plotKaryotype(genome= genome_gr, plot.type = 1, plot.params = plot.params)

## No predefined canonical chromosomes found for the requested genome. Applying a heuristic chromosome filtering.

## To get the unfiltered genome, please set chromosomes="all" in the plotKaryotype call

kpPlotMarkers(kp, snps_gr, labels = snp_labels)

Add numeric data to the plot plot to 100 windows on chromosome 4

numeric_data <- data.frame(
  y = rnorm(100, mean = 1, sd = 0.5),
  chr = rep("chr4", 100),
  start = seq(1,20862711, 20862711/100),
  end = seq(1,20682711, 20862711/100)
)

make the data a granges object

numeric_data_gr <- makeGRangesFromDataFrame(numeric_data)

plot.params <- getDefaultPlotParams(plot.type = 2)

plot.params$data1outmargin <- 800
plot.params$data2outmargin <- 800
plot.params$topmargin <- 800

kp <- plotKaryotype(genome = genome_gr, plot.type = 1, plot.params = plot.params)

## No predefined canonical chromosomes found for the requested genome. Applying a heuristic chromosome filtering.

## To get the unfiltered genome, please set chromosomes="all" in the plotKaryotype call

kpPlotMarkers(kp, snps_gr, labels = snp_labels)
kpLines(kp, numeric_data_gr, y = numeric_data$y, data.panel = 2)

Final Project BISC 516

Megan Barr

2024-11-11