R Markdown

This is tutorial on how to use R Markdown for reproducible research.

Here we can type long passages or descriptions of our data without the need of “hashing” out our comments with the # symbol. In our first example, we will be using the ToothGrowth dataset. In this experiment, Guinea Pigs (literal) were given different amounts of Vitamin C to see the effects on the animal’s tooth growth.

To run R code in a markdown file, we need to denote the section that is consdiered R code. We call these sections “code chunks.”

Below is a code chunk:

Toothdata <- ToothGrowth

head(Toothdata)

As you can see, from running the “play” button on the code chunk, the results are printed inline of the R Markdown file.

```{r, fig.cap= “Figure 1: The tooth growth of Guinea Pigs when given variable amounts of Vitamin C”}

fit <- lm(len ~ dose, data = Toothdata)

b <- fit$coefficients

plot(len ~ dose, data = Toothdata)

abline(lm(len ~ dose, data = Toothdata))



The slope of the regression line is `r b[2]`

## Section Headers

We can also put sections and subsections in our R Markdown file. Similar to numbers or bullet points in a word document. This is done with the "#" that we previously used to denote text in an R script.

# First level header

## Second level header

### Third level header

Make sure that you put a space after the hash tag, otherwise it will not work!

We can also add bullet point-type marks in our R Markdown file.

- One item
- One item
- One item
  - One more item
  - One more item
  - One more item
    - One last item
    
It is important to note here that in R Markdown, indentation matters!!

1. First Item
2. Second Item
3. Third Item
  a. Subitem One
  b. Subitem Two
  c. Subitem Three
  
## Block Quotes

We can put really nice quotes into the markdown document. We do this by using the ">" symbol. 

> "Genes are like the story, and DNA is the language that the story is written in."
>
>--- Sam Kean

## Hyperlinks

Hyperlinks can also be incorporated into these files. This is especially useful in HTML files, since they are in the web browser and will redirect the reader to the material that you are interested in showing them. Here we will use the link to R Markdown's Homepage for this example.
[Markdown](http://rmarkdown.rstudio.com/)

## Formulas

We can also put nice formatted formulas into Markdown using two dollar signs.

Hard-Weinberg Formula

$$p^2 + 2pq + q^2 = 1$$

And you can get really complex as well!

$$\Theta = \begin{pmatrix}\alpha & \beta\\
\gamma & \delta
\end{pmatrix}$$

## Code Chunks {.tabset}

## Code Chunk Options

```{r}
print("Hello World")

There are also options for your R Markdown file on how knitr interprets the code chunk. There are the following options.

Eval (T or F): Whether or not to evaluate the code chunk.

Echo (T or F): Whether or not to show the code for the chunk, but results will still print.

Cache: If enable, the same code chunk will not be evaluated the next time that the knitr is run. Great for code that has LONG runtimes

fig.width or fig.height: the (graphical device) size of the R plots in inches. The figures are first written to the knitr document then to files that are saved separately.

out.width or out.height: The output size of the R plots in the R document

fig.cap: The words for the figure caption.

We can also add a table of contents to our HTML Document. We do this by altering the YAML code (the weird code chunk at the VERY top of the document.) We can add this:

title: "HTML_Tutorial"
author: "Gwyneth"
date: "5/9/2022"
output: 
  html_document:
    toc: true 
    toc_float: true

This will give us a very nice floating table of contents on the right hand side of the document.

Tabs

You can also ad TABS in our report. To do this you need to specify each section that you want to become a tab by placing “{.tabset}” after the line. Every subsequent header will be a new tab.

Themes

You can also add themes to your HTML document that change the highlighting color and hyperlink color fo your html output. THis can be nice asthetically. To do this, you change your theme in YAML to one of the following:

Cerulean Journal Flatly Readable Spacelab United Cosmo Lumen Paper Sandstone Simplex Yeti Null

You can also change the color by specifying highlight:

Default Tango Payments Kate Monochrome Espresso Zenburn Haddock Textmate

Code Folding

You can also use the code_folding option to allow the reader to toggle betweeen displaying the code and hiding the code. This is done with:

code_folding: hide

There are a ton of options and ways for you to customize your R code using HTML format. This is also a great way to display a “portfolio” of your work if you are trying to market yourself to interested parties.

Data Wrangling with R

library(tidyverse)

??flights

my_data <- nycflights13::flights

head(my_data)

First we will just look at the data on October 14th

filter(my_data, month == 10,day ==14)

If we want to subset this into a new variable, we do the following:

oct_14_flight <- filter(my_data, month == 10, day ==14)

What if you want to do both print and save the variable?

(oct_14_flight2 <- filter(my_data, month == 10, day ==14))

If you want to filter based on different operators, you can use the following: Equals == Not equal to != Greater than > Less than < Greater than or equal to >= Less than or equal to <=

(flights_through_september <- filter(my_data, month < 10))

If we do not use the == to mean equals, we get this: (oct_14_flight2 <- filter(my_data, month = 10, day = 14))

You can also use logical operators to be more selective

And (&) Or (|) Not (!)

Let’s use the “or” function to pick flights in March and April

March_April_Flights <- filter(my_data, month ==3 | month ==4)

March_4th_Flights <- filter(my_data, month ==3 & day ==4)

Non_Jan_Flights <- filter(my_data, month !=1)

Arrange

Arrange allows us to arrange the dataset based on the variables we desire.

arrange(my_data, year, day, month)

We can also do this in descending fashion

descending <-arrange(my_data, desc(year), desc(day), desc(month))

Missing values are always placed at the end of the dataframe regardless of ascending or descending.

Select

We can also select specific columns that we want to look at.

calendar <- select(my_data, year, month, day)
print(calendar)

We can also look at a range of columns

calender2 <- select(my_data, year:day)

Let’s look at all columns months through carrier

calender3 <- select(my_data, year:carrier)

We can also choose which columns not to include

Everthing_else <- select(my_data, -(year:day))

In this instance we can also use the “not” operator !

Everthing_else2 <- select(my_data, !(year:day))

There are also some other helper functions that can help you select the columns or data you’re looking for: Starts_with(“xyz”) – Will select the values that start with “xyz” Ends_with”xyz”) — Will select the values that end with xyz Contains(“xyz”) —- Will select the values that contain xyz Matches(“xyz”) —– Will match the identical value xyz

Renaming

Here, we see that we can rename the data:

head(my_data)

rename(my_data, departure_time = dep_time)

my_data2 <- rename(my_data, departure_time = dep_time)

Mutate

What if you want to add new columns to your data frame? We have the mutate() function for that.

First, let’s make smaller data frame so we can see what we’re doing

my_data_small <- select(my_data, year:day, distance, air_time)

Let’s calculate the speed of the flights

mutate(my_data_small, speed = distance / air_time *60)

my_data_small <-mutate(my_data_small, speed = distance / air_time *60)

What if we wanted to create a new dataframe with ONLY your calculations (transmute)?

airspeed <- transmute(my_data_small, speed = distance/air_time* 60, speed2=distance / air_time)

Summarize and by_group()

We can use summarize to run a function on a data column to get a single return

summarize(my_data, delay = mean(dep_delay, na.rm =TRUE))

We gain additional value in summarize by pairing it with by_group()

by_day <- group_by(my_data, year, month, day)
summarize(by_day, delay = mean(dep_delay, na.rm=TRUE))

As you can see, we now have the delay by the days of the year

Missing Data

What happens if we don’t tell R what to do with the missing data?

summarize(by_day, delay = mean(dep_delay))

We can filter our data based on NA (which in this dataset was cancelled flights)

not_cancelled <- filter(my_data, !is.na(dep_delay), !is.na(arr_delay))

Let’s run summarize again on this data

summarize(not_cancelled, delay = mean(dep_delay))

Counts

We can count the numnber of variables that are NA

sum(is.na(my_data$dep_delay))

We can also count the numbers that are NOT NA

sum(!is.na(my_data$dep_delay))

Piping

With tibble datasets, we can pipe results to get rid of the need to use the dollar sign We can then summarize the number of flights by minutes delayed

my_data %>%
  group_by(year, month, day) %>%
  summarize(mean = mean(dep_time, na.rm =TRUE))

Tibbles

Now we will take the time to explore tibbles!

Tibbles are modified dataframes which tweak some of the older features from dataframes. R is an old language, and useful things from 20 years ago are just not as useful anymore

as_tibble(iris)

As we can see, we have the same dataframe, but we have different features.

You can also create a tibble from scratch with tibble()

tibble(
  x=1:5,
  y=1,
  z=x^2+y
)

You can also use tribble() for basic data table creation

tribble(
  ~genea, ~geneb, ~genec,
  #######################
  110,       112,   114,
  6,           5,     4,
)

Tibbles are built to not overwhelm your console when printing data, only showing the first few lines

This is how a dataframe prints:

print(by_day)

as.data.frame(by_day)

head(by_day)

nycflights13::flights %>%
  print(n=10, width=Inf)

Subsetting

Subsetting tibbles is easy, and is very similar to data frames

df_tibble <-tibble(nycflights13::flights)

df_tibble

We can subset by column name using the ($) ```{r} df_tibble$carrier


We can also subset by position sing [[]]
```{r}
df_tibble[[2]]

If you want to use this in a pipe, you need to use the “.” placeholder

df_tibble %>%
  .$carrier

Some older functions do not like tibbles. Thus, you might have to convert them back to dataframes

class(df_tibble)

df_tibble_2 <- as.data.frame(df_tibble)

class(df_tibble_2)

df_tibble

head(df_tibble_2)

Tidyr

library(tidyverse)

How do we make a tidy dataset? Well the tidyverse follows three rules: #1 - Each variable must have its own column #2 - Each observation has its own row #3 - Each value has its own cell

It is impossible to satisfy two of the three rules. This leads to the following instructions for tidy data:

#1 put each dataset into a tibble #2 put each varible into a column #3 profit

Picking on econsistent method of data storage makes for easier understanding of your code and what is happening “under the hood” or behind the scenes.

Let’s now look at working with tibbles!

bmi <- tibble(women)

bmi %>%
  mutate(bmi = (703 * weight)/(height)^2)

Spreading and Gathering

Sometimes you’ll find datasets that don’t fit well into a tibble. We’ll use the built-in data from tidyverse for this part.

table4a

As you can see from this data, we have one variable in column A (country) but column B and C are two of the same. Thus, there are two observations in each row.

To fix this, we can use the gather function

table4a %>%
  gather("1999", "2000", key="year", value="cases")

Let’s look at another example:

table4b

As you can see, we have the same problem in table4b:

table4b %>%
  gather("1999", "2000", key="year", value="population")

Now what if we want to join these two tables? We can use dplyr!

table4a <-table4a %>%
  gather("1999", "2000", key="year", value="cases")
table4b <-table4b %>%
  gather("1999", "2000", key="year", value="population")

left_join(table4a, table4b)

Spreading

Spreading is the opposite of gathering! Let’s look at table2

print(table2)

You can see that we have redundant info in columns 1 and 2! We can fix that by combining rows 1 and 2, 3 and 4, etc.

spread(table2, key=type, value=count)

Type is the key of what we are turning into columns, the value is what becomes rows/observations. In summary, spread makes long tables shorter and wider. Gather makes wide tables, narrower and longer.

Separating and Pull

Now what happens when we have two observations stuck in one column?

print(table3)

As you can see, the rate is just the population and cases combined. We can use separate to fix this:

table3 %>%
  separate(rate, into=c("cases", "population"))

However, if you notice, the column type is not correct

table3 %>%
  separate(rate, into=c("cases", "populate"), conver=TRUE)

You can specify what you want to separate based on.

table3 %>%
  separate(rate, into =c("cases", "populate"), sep= "/", conver=TRUE)

Now, let’s make this look more tidy

table3 %>%
  separate(
    year,
    into=c("cases", "population"),
    convert=TRUE,
    sep= 2
    )

Unite

What happens if we want to do the inverse of separate? We use unite.

table5

table5 %>%
  unite(date, century, year)

table5 %>%
  unite(data, century, year, sep= "")

Missing Values

There can be two types of missing values, NA (explicit) or just no entry (implicit)

gene_data <- tibble(
  gene= c("a", "a", "a", "a", "b", "b", "b"),
  nuc =c(20, 22, 24, 25, NA, 42, 67),
  run =c(1,2,3,4,2,3,4)
)
gene_data

The nucleotide count for Gene B Run 2 is explicit missing. The nucleotide count for Gene B Run 1 is implicitly missing.

One way we can make implicit missing values explicit is by putting runs in columns

gene_data %>%
  spread(gene, nuc)

If we want to remove the missing values, we can use spread and gather, and na.rm = TRUE

gene_data %>%
  spread(gene, nuc) %>%
  gather(gene, nuc, "a":"b", na.rm = TRUE)

Another way that we can make implicit values explicit, is complete()

gene_data %>%
  complete(gene, run)

Sometimes an NA is present to represent a value being carried forward

treatment <- tribble(
  ~person,           ~treatment,         ~response,
  #################################################
  "Issac",                1,                   7,
  NA,                     2,                   10,
  NA,                     3,                   9,
  "VBD",                  1,                   8,
  NA,                     2,                   11, 
  NA,                     3,                   10,
)

treatment

What we can do here is use the fill() option

treatment%>%
  fill(person)

DPLYR

It is rare that you will be working with a single data table. THE DPLYR package allows you to join two data tables based on common values!

Mutate joins - add new variables to one data frame from the matching observations in another Filtering joins - filters observation from one data frame based on whether or note #They are present in another Set operations - treats observations as they are set elements

library(tidyverse)
library(nycflights13)

Let’s pull full carrier names based on letter codes

airlines

Let’s get info about airports

airports

Let’s get info about each plane

planes

Let’s get some info on the weather at the airports

weather

Let’s get info on singular flights

flights

Let’s look at how these tables connect: flights -> planes based on tailnumber flights -> airlines through carrier flights -> airports origin and dest flights -> weather via origin, year/month/day/hour

Keys

Keys are unique identifiers per observation. Primary key uniquely identifies an observation in its own table.

One way to identify a primary key is as follows:

planes %>%
  count(tailnum) %>%
  filter(n>1)

This indicates that the tailnumber is unique!

planes %>%
  count(model) %>%
  filter(n>1)

Mutate join

flights2 <- flights %>%
  select(year:day, hour, origin, dest, tailnum, carrier)

flights2

flights2 %>%
  select(~origin, ~dest) %>%
  left_join(airlines, by = "carrier")

We’ve now added the airline name to our data frame from the airline data frame!

Other types of joins

Inner joins (inner_join) matches a pair of observations when their key is equal Outer joins (outer_join) keeps observations that appear in at least one table.

Stringr

library(tidyverse)
library(tringr)

You can create strings using single or double quotes!

string1 <- "this is a string"
string2 <- 'To put a "quote" in your string, use the  opposite'

string1
string2

If you forget to close your string, you’ll get this:

string3 <- "Where is this going?"

string3

Just hit escape and try again! Multiple strings are stored in character vetors!

string4 <- c("one", "two", "three")
string4

Measuring string length:

str_length(string3)

str_length(string4)

Let’s combine two strings!

str_c("x", "y")

str_c(string1, string2)

You can use sep to control how they are separated!

str_c(string1, string2, sep = " ")

str_c("x", "y", "z", sep = "_")

Subsetting Strings

You can subset a string using str_sub()

HSP <- c("HSP123", "HSP234", "HSP456")

str_sub(HSP, 4,6)

This just drops the first four letters from the strings!

Or you can use negatives to count back from the end

str_sub(HSP, -3, -1)

You can convert the cases of strings like follows:

HSP
str_to_lower(HSP)
str_to_upper()

Regular Expression

install.packages("htmlwidgets")

x <- c('ATTAGA', 'CGCCCCCCGGAT', 'TATTA')

str_view(x, "G")

str_view(x, "TA")

The next step is, “.” where the “.” matches an entry

str_view(x, ".G.")

Anchors allow you to match at the start or the ending

str_view(x, "^TA")

str_view(x, "TA$")

Character classses/alternatives atches any digit matches any space [abc] matches a, b, or c

str_view(x, "TA[GT]")

[^anc] matches anything BUT a, b, or c

str_view(x, "TA[^T]")

You can also use | to pick between two alternatives

str_view(x, "TA[G|T]")

Detect Matches

str_detect() returns a logical vector the same length of input

y <-c("apple", "banana", "pear")
y

str_detect(y, "e")

How many common words start with letter e?

words

sum(str_detect(words, "e"))

Let’s get more complex, what proportion words end in a vowel?

mean(str_detect(words, "[aeiou]$"))

mean(str_detect(words, "^[aeiou]"))

Let’s find all the words that don’t contain “o” or “u”

no_o <- !str_detect(words, "[ou]")

no_o

Now let’s extract!

words[!str_detect(words, "[ou]")]

You can also use str_count() to say how many matches there are in string

x

str_count(x, "[GC]")

Let’s couple this with mutate:

df <- tribble(
  word = words,
  count = seq_along(word)
)

df

df %>%
  mutate(
    vowels = str_count(words, "[aeiou]"),
    constonants = str_count(words, "[^aeiou]")
  )

Bioinformatics with R

Limma Microarray Analysis

First, we need to load the necessary packages!

setwd("C:~/Desktop/classroom/myfiles/Bric16_Targets.csv")
#setwd("~/Desktop/classroom/myfiles")


library(ath1121501cdf)
library(ath1121501.db)
library(annotate)
library(affy)
library(limma)
library(oligo)
library(dplyr)
library(stats)
library(reshape)

Read the cel files into the directory

targets <- readTargets("Bric16_Targets.csv", sep= ",", row.names="filename")

ab <-ReadAffy()

#####################################
hist(ab)

Normalizing the microarray experiments:

eset <-affy::rma(ab)

pData(eset)

Let’s visualize the normalized data!

hist(eset)

Let’s characterize the data a bit!

ID <-featureNames(eset)
Symbol <-getSYMBOL(ID, "ath1121501.db")

affydata <-read.delim("affy_ATH1_array_elements.txt")

Differential Gene Expression Analysis

Flight vs. Ground

condition <- targets$gravity

design <-model.matrix(~factor(condition))
colnames(design) <-c("flight", "ground")

fit <-lmFit(eset, design)
fit <-eBayes(fit)
options(digits=2)
top <- topTable(fit, coef=2, n=Inf, adjust='fdr')

Let’s combine annotations!

Annot <- data.frame(GENE = sapply(contents(ath1121501GENENAME), paste, collapse = ", "),
                    KEGG = sapply(contents(ath1121501PATH), paste, collapse = ", "),
                    GO = sapply(contents(ath1121501GO), paste, collapse = ", "),
                    SYMBOL = sapply(contents(ath1121501SYMBOL), paste, collapse = ", "),
                    ACCNUM = sapply(contents(ath1121501ACCNUM), paste, collaspe = ", "))

Let’s merge all the data into one dataframe

all <-merge(Annot, top, by.x=0, by.y=0, all=TRUE)

all2 <-merge(all, affydata, by.x ="Row.names", by.y ="array_element_name")

Let’s convert the ACCNUM to a character value

all2$ACCNUM <-as.character(all$ACCNUM)

write.table(all2, file="BRIC_16_Final.csv", row.names=TRUE, sep="\t")

columns(org.At.tair.db)

foldchanges <-as.data.frame(all$logFC)

all2$entrez=mapIds(org.At.tair.db,
                  keys=all2$ACCNUM,
                  column="ENTREZID",
                  keytype ="TAIR",
                  multiVals = "first")

head(all2, 10)

Pathview

library(pathview)
library(gage)
library(gageData)
data(kegg.sets.hs)

foldchanges=all2$logFC
names(foldchanges) = all2$entrez

head(foldchanges)

kegg.ath=kegg.gsets("ath", id.type="entrez")
kegg.ath.sigmet = kegg.ath$kg.sets[kegg.ath$sigmet.idx]

Let’s get the results!

keggres = gage(foldchanges, gsets=kegg.ath.sigmet, same.dir=TRUE)

lapply(keggres, head)

Let’s find the greaters and lessers!

greater <- keggres$greater
lessers <- keggres$less

write.csv(greater, file="BRIC_16_pathview_Greater_Pathways.csv")
write.csv(lessers, file="BRIC_16_pathview_Lesser_Pathways.csv")

Let’s Map to Pathways (great)

keggrespathways = data.frame(id=rownames(keggres$greater), keggres$greater) %>%
  tbl_df() %>%
  filter(row_number() <=5) %>%
  .$id %>%
  as.character()
keggrespathways

keggresids_greater=substr(keggrespathways, start=1, stop=8)
keggresids_greater

genedata <-as.character(all$logFC)

Define a plotting function to apply next!

plot_pathway = function(pid) pathview(gene.data=foldchanges, pathway.id=pid, species="ath", new.signature=FALSE)

Plot multiple pathways! (plots are saved to disk and returns a throwaway object list)

tmp=sapply(keggresids_greater, function(pid) pathview(gene.data=foldchanges, pathway.id=pid, species="ath"))

Let’s Map to Pathways (lesser)

keggrespathways = data.frame(id=rownames(keggres$lessers), keggres$lessers) %>%
  tbl_df() %>%
  filter(row_number() <=5) %>%
  .$id %>%
  as.character()
keggrespathways

keggresids_lessersstr(keggrespathways, start=1, stop=8)
keggresids_lessers

genedata <-as.character(all$logFC)

Define a plotting function to apply next!

plot_pathway = function(pid) pathview(gene.data=foldchanges, pathway.id=pid, species="ath", new.signature=FALSE)

Plot multiple pathways (plots are saved to disk and returns a throwaway object list)

tmp=sapply(keggresids_lessers, function(pid) pathview(gene.data=foldchanges, pathway.id=pid, species="ath"))

EdgeR

Let’s load the necessary packages for EdgeR!

library("edgeR")
library("dplyr")
library("AnnotationDbi")
library("org.Mm.eg.db")

Let’s analyze our rawdata!


rawdata=read.csv("~/Desktop/classroom/myfiles/GLDS-102_rna_seq_Normalized_Counts.csv", row.names="gene_id")

group <-factor(c('1', '1', '1', '1', '1', '1', '2', '2', '2', '2', '2', '2'))

dgeGlm <-DGEList(counts=rawdata, group=group)
str(dgeGlm)

keep <- rowSums(cpm(dgeGlm)>2) >=4

dgeGlm <- dgeGlm[keep,]

names(dgeGlm)

dgeGlm[["samples"]]

design <-model.matrix(~group)
design

dgeGlmComDisp <- estimateGLMCommonDisp(dgeGlm, design, verbose=TRUE)

dgeGlmTrendDisp <- estimateGLMTrendedDisp(dgeGlmComDisp, design)

dgeGlmTagDisp <-estimateGLMTagwiseDisp(dgeGlmTrendDisp, design)

plotBCV(dgeGlmTagDisp)

fit <- glmFit(dgeGlmTagDisp, design)

colnames(coef(fit))

lrt <-glmLRT(fit, coef=2)

ttGlm <-topTags(lrt, n=Inf)

class(ttGlm)

Let’s summarize the data!

summary(deGlm <- decideTestsDGE(lrt, p=0.1, adjust="fdr"))

tagsGlm <- rownames(dgeGlmTagDisp)[as.logical(deGlm)]

plotSmear(lrt, de.tags=tagsGlm)

hits2 <- ttGlm$table[ttGlm$table$FDR < 0.1,]

write.csv(hits2, "Mouse_EdgeR_Results_Zero_vs_1.csv")

Let’s organize our data!


columns(org.Mm.eg.db)

ttGlm2 <- as.data.frame(ttGlm$table)

ttGlm2$symbol = mapIds(org.Mm.eg.db,
                       keys=row.names(ttGlm2),
                       column="SYMBOL",
                       keytype = "ENSEMBL",
                       mutivals = "first")
ttGlm2$entrez = mapIds(org.Mm.eg.db,
                       keys=row.names(ttGlm2),
                       column="ENTREZID",
                       keytype = "ENSEMBL",
                       mutivals = "first")
ttGlm2$name = mapIds(org.Mm.eg.db,
                       keys=row.names(ttGlm2),
                       column="GENENAME",
                       keytype = "ENSEMBL",
                       mutivals = "first")

head(ttGlm2)

Let’s load the packages for our pathways!

library(pathview)
library(gage)
library(gageData)
data(kegg.sets.mm)
data(go.subs.mm)

foldchanges <- ttGlm2$logFC
names(foldchanges) <-ttGlm2$entrez
kegg.mm = kegg.gsets("mouse", id.type = "entrez")
kegg.mm.sigmet <-kegg.mm$kg.sets[kegg.mm$sigmet.idx]

Let’s get the results!

keggres=gage(foldchanges, gsets = kegg.mm.sigmet, same.dir =TRUE)

lapply(keggres, head)

greaters <- keggres$greater
lessers <- keggres$less

Let’s explore our pathways (greater)!


keggrespathways = data.frame(id=rownames(keggres$greater), keggres$greater) %>%
  tbl_df() %>%
  filter(row_number() <=5) %>%
  .$id %>%
  as.character()
keggrespathways

keggresids=substr(keggrespathways, start=1, stop=8)
keggresids

Let’s plot our pathway!

plot_pathway=function(pid) pathview(gene.data=foldchanges, pathway.id=pid, species = "mouse", new.signature=FALSE)

# Plot multiple pathways
tmp=sapply(keggresids, function(pid) pathview(gene.data=foldchanges, pathway.id=pid, species="mouse"))

Let’s explore the pathways (lesser)!



keggrespathways = data.frame(id=rownames(keggres$less), keggres$less) %>%
  tbl_df() %>%
  filter(row_number() <=5) %>%
  .$id %>%
  as.character()
keggrespathways

keggresids=substr(keggrespathways, start=1, stop=8)
keggresids

Let’s plot our pathways (lesser)!

plot_pathway=function(pid) pathview(gene.data=foldchanges, pathway.id=pid, species = "mouse", new.signature=FALSE)

Plot multiple pathways!

tmp=sapply(keggresids, function(pid) pathview(gene.data=foldchanges, pathway.id=pid, species="mouse"))

Let’s save the file!


library(imager)

filenames <- list.files(path = "E:/Bioinformatics/BISC_450_Scripts/mouse_edgeR", pattern=".*pathview.png")

all_images <-lapply(filenames, load.image)

Let’s use the knitr function, to include graphics!


knitr::include_graphics(filenames)

DESeq

Lets load the required libraries for this analysis

BiocManager::install("apeglm", update=FALSE)
library("DESeq2")
library("dplyr")
library("apeglm")

Lets load in our data

countData <-read.csv("~/Desktop/classroom/myfiles/GLDS-102_rna_seq_Unnormalized_Counts.csv")

colData <- read.csv("~/Desktop/classroom/myfiles/PHENO_DATA_Mouse.csv", sep=",", row.names=1)

Now we need to add leveling factors to the colData

colData$group <-factor(colData$group, levels = c("0", "1"))

Now lets make sure we have the same number of individuals as well as groups


all(rownames(colData)) %in% colnames(countData)

We need to round up the counts, because DESeq2 only allows integers as an input, and not fractional numbers. This depends on the method that was used upstream. ```{r, results=‘hide’} countData %>% mutate_if(is.numeric, ceiling)

countData[, 2:13] <- sapply(countData[, 2:13], as.integer)

row.names(countData) <-countData[,1]

countData <- countData[2:13]

rownames(colData) == colnames(countData)

seq(dds)

res <- results(dds)


Let's summarize our data!
```{r}

resLFC <- lfcShrink(dds, coef=2)

(resOrdered <- res[order(res$padj), ])

summary(res)

Let’s load our data into a dataframe!


FLT_Vs_GC <- as.data.frame(res$log2FoldChange)

head(FLT_Vs_GC)

Let’s plot our data!


plotMA(resLFC, ylim=c(-2,2))

pdf(file ="MA_Plot_FLT_vs_GC.pdf")


dev.off()

Lets save our differential expression results to file.


write.csv(as.data.frame(resOrdered), file="Mouse_DESeq.csv")

Lets perform a custom transformation

dds <- estimateSizeFactors(dds)

se <- SummarizedExperiment(log2(counts(dds, normalize=TRUE) +1), colData=colData(dds))

pdf(file="PCA_PLOT_FLT_vs_GC.pdf")

plotPCA(DESeqTransform(se), intgroup="group")

Lets load our required packages

library(AnnotationDbi)
library(org.Mm.eg.db)

Let’s use foldchanges in our packages!

columns(org.Mm.eg.db)

foldchanges <-as.data.frame(res$log2FoldChange, row.names=row.names(res))
head(foldchanges)

res$symbol = mapIds(org.Mm.eg.db,
                    keys=row.names(res),
                    column="SYMBOL",
                    keytype="ENSEMBL",
                    multiVals = "first")
res$entrez = mapIds(org.Mm.eg.db,
                    keys=row.names(res),
                    column="ENTREZID",
                    keytype="ENSEMBL",
                    multiVals = "first")
res$name = mapIds(org.Mm.eg.db,
                    keys=row.names(res),
                    column="GENENAME",
                    keytype="ENSEMBL",
                    multiVals = "first")

head(res, 10)

Lets load our pathview packages!

library(pathview)
library(gage)
library(gageData)
data(kegg.sets.mm)
data(go.subs.mm)

foldchanges <- res$log2FoldChange
names(foldchanges) = res$entrez

head(foldchanges)

kegg.mm = kegg.gsets("mouse", id.type= "entrez")
kegg.mm.sigmet <-kegg.mm$kg.sets[kegg.mm$sigmet.idx]

Lets map the results!

keggres <- gage(foldchanges, gsets = kegg.mm.sigmet, same.dir = TRUE)

lapply(keggres, head)

Lets save our greater and less than pathways

greaters <-keggres$greater
lessers <- keggres$less

Let’s analyze our greater pathways!


keggrespathways <- data.frame(id = rownames(keggres$greater), keggres$greater) %>%
  tbl_df() %>%
  filter(row_number() <=3) %>%
  .$id %>%
  as.character()
keggrespathways

keggresids <-substr(keggrespathways, start=1, stop=8)
keggresids

Let’s plot our packages!


plot_pathway=function(pid) pathview(gene.data=foldchange, pathway.id=pid, species= "mouse", new.signature=FALSE)

tmp = sapply(keggresids, function(pid) pathview(gene.data=foldchanges, pathway.id=pid, species="mouse"))

Let’s format our lesser pathways!


keggrespathways <- data.frame(id = rownames(keggres$less), keggres$less) %>%
  tbl_df() %>%
  filter(row_number() <=3) %>%
  .$id %>%
  as.character()
keggrespathways

keggresids <-substr(keggrespathways, start=1, stop=8)
keggresids

Let’s plot our pathways!


plot_pathway=function(pid) pathview(gene.data=foldchange, pathway.id=pid, species= "mouse", new.signature=FALSE)

tmp = sapply(keggresids, function(pid) pathview(gene.data=foldchanges, pathway.id=pid, species="mouse"))

Let’s file our information into a document!


library(imager)

filenames <-list.files(path= "E:/Bioinformatics/Bisc_450_Scripts/mouse_Deseq", pattern=".*pathview.png")

all_images <- lapply(filenames, load.image)

Let’s use the knitr function to include graphics in our new file!


knitr::include_graphics(filenames)

EdgeR vs. DESeq

Let’s load the necessary library for this!

library(tidyverse)

Let’s view the documents we created earlier!

EdgeR <- read.csv("~/Desktop/classroom/myfiles/ged015@latech.edu/Mouse_EdgeR_Results_Zero_vs_1.csv")

DESeq <- read.csv("~/Desktop/classroom/myfiles/ged015@latech.edu/Mouse_DESeq.csv")

DESeq2 <- DESeq %>%
  filter(padj < 0.1)

DESeq2 <- DESeq2[,c(1,3)]

EdgeR <- EdgeR[, 1:2]

colnames(DESeq2) <-c("ID", "logFC")
colnames(EdgeR) <-c("ID", "logFC")

Install the “Goplot” package

install.packages("GOplot")
library(GOplot)

Plot and create a table using the previous information!

comp <-GOVenn(DESeq2, EdgeR, label =c("DESeq1", "EdgeR"), title="Comparison of DESeq and EdgeR DE Genes", plot=FALSE)

comp$plot

comp$table

Multiple Protein Alignment

Let’s load the library needed for this!

library(msa)

seq <-readAAStringSet("hglobin.fa")

seq

Let’s align the 8 different amino acid sequences

alignments <-msa(seq, method="ClustalW")
alignments

msaPrettyPrint(alignments, output="pdf", 
               showNames="left", showLogo="none", askForOverwrite = FALSE, 
               verbose=TRUE, file="whole_align.pdf")

msaPrettyPrint(alignments, c(10, 30), output="pdf", showNames="left",
               file = "Zoomed_align.pdf", showLogo = "top", askForOverwrite = FALSE,
               verbose=TRUE)

Let’s make a tree from our alignment

library(ape)
library(seqinr)

Convert the seqinr alignment -> get the distnaces and make a tree

alignment_seqinr <- msaConvert(alignments, type = "seqinr::alignment")

distances1 <- seqinr::dist.alignment(alignment_seqinr, "identity")

tree <- ape::nj(distances1)
plot(tree, main="Phylogenetic Tree of MBA Sequences")

Let’s load the necessary library!

library(DECIPHER)

In the first step, we load the libraries and sequence into long_seqs. This is a DNAStringSet object ~Desktop/classroom/myfiles

long_seq <- readDNAStringSet("plastid_genomes.fa")
long_seq

Now let’s build a temporary SQLite database

Seqs2DB(long_seq, "XStringSet", "long_db", names(long_seq))

Now that we have built the database, we can do the following: Find the syntenic blocks

synteny <- FindSynteny("long_db")

View blocs with plotting!

pairs(synteny)
plot(synteny)

Make an actual alignment file

alignment <- AlignSynteny(synteny, "long_db")

Let’s create a structure holding all aligned sytenic blocks for a pair of sequences

block <- unlist(alignment[[1]])

We can write to file one alignment at a time

writeXStringSet(block, "genome_blocks_out.fa")

Let’s load the neccesary libraries!

library(locfit)
library(Rsamtools)

Let’s create a function that will load the gene region information in a GFF file and convert it to a bioconductor GRanges object

gen_annotated_regions_from_gff <- function(file_name) {
  gff <- rtracklayer::inport.gff(file_name)
  as(gff, "GRanges")
}

Get count in windows across the genome in 500bp segments

whole_genome <- csaw::windowCounts(
  # file.path(~Desktop/classroom/myfiles)
  file.path(getwd(), "windows.bam"),
  bin=TRUE,
  filter=0,
  width=500,
  param=csaw::readParam(
    minq=20, # Determines what is a passing read
    dedup=TRUE, # Removes PCR duplicate
    pe="both" # Requires that both pairs of reads are aligned
    )
)

Since this is a single column of data, let’s rename it:

colnames(whole_genome) <-c("small_data")

annotated_regions <- get_annotated_regions_from_gff(file.path(getwd(), "genes.gff"))

Now that we have the windows of high expression, we want to see if any of them overlap with annotated regions.

library(IRanges)
library(SummarizedExperiment)

Find the overlaps between the windows we made

windows_in_genes <- IRanges::overlapsAny(
  summarizedExperiment::rowRanges(whole_genome), # Creates a Granges object
  annotated_regions
)

windows_in_genes

Here we subset the whole_genome_object into annotated and nonannotated regions

annotated_window_counts <- whole_genome[windows_in_genes,]
non_annotated_window_counts <- whole_genome[!windows_in_genes,]

Assay

Use assay () to extrac the actual counts from the Granges object

assay(non_annotated_window_counts)

In this setp, we use Rsamtools Pulesup() function to get a per-base coverage dataframe. Each row represents a single nucleotide in the reference count and the count column gives the depth of coverage at that point

library(bumphunter)

pile_df <- Rsamtools::pileup(file.path(getwd(), "windows.bam"))

This step groups the reads with certain distance of each other into a cluster, we give the sequence names, position and distance.

clusters <- bumphunter::clusterMaker(pile_df$seqnames, pile_df$pos, maxGap=100)

table(clusters)

In this step, we will map the reads to the regions we found for the genome

bumphunter::regionFinder(pile_df$count, pile_df$segnames, pile_df$pos, clusters, cutoff=1)

Treeio

Let’s load the required packages

library(ggplot2)
library(ggtree)
library(treeio)

First, we need to load our raw tree data. Its a Newick format, so we use:

itol <- ape::read.tree("itol.nwk")

Now we will print our a very basic phylogenic tree

ggtree(itol)

We can also change the format to make it a circular tree

ggtree(itol, layout = "circular")

We can also change the left-right/up-down direction

ggtree(itol) + coord_flip() + scale_x_reverse()

By using geom_tipla() we can add names to the end of the tips

ggtree(itol) + geom_tiplab(color = "indianred", size =1.5)

By adding a geom_strip() layer, we can annotate clades in the tree with a block of color

ggtree(itol, layout = "unrooted") + geom_strip(13, 14, color = "red", barsize =1)

We can highlight clases in unrooted trees with blobs of color using geom_hilight

ggtree(itol, layout = "unrooted") + geom_hilight(node=11, type="encircle", fill ="steel blue")

Most Common Recent Ancestor (MCRA)

Let’s install the necessary packages!

install.packages('BAMMtools')
library(BAMMtools)

We can use the MRCA function to find the node we want

getmrca(itol, "Photorhabdus_luminescens", "Blochmanmia_floridanus")

Now if we want to highlight the section of the most recent common ancestor between the two above

ggtree(itol, layout ="unrooted")+geom_hilight(node=206, type="encircle", fill="steel blue")

Quantifying difference between trees with treespace

First, let’s load the required packages

library(ape)
library(adegraphics)
library(treespace)

Now we need to load all the treefiles into a mulitPhylo object

treefiles <- list.files(file.path(getwd(), "gene_trees"), full.names=TRUE)
tree_list <- lapply(treefiles, read.tree)
class(tree_list)

class(tree_list) <- "multiPhylo"

class(tree_list)

Now we can compute the kendall coljin distances between trees. This function does a LOT of analyses.

First, it runs a pairwise comparison of all trees in the input Second, it carries out clustering using PCA These results are returned in a list of objects, where [$D] contains the pairwise metric of the trees, and ($pco) contains the PCA. The we use (Kendal-Coljin) is particularly suitable for rooted trees as we have here. The option NF tells us how many principal components to retain.

comparisons <- treespace(tree_list, nf=3)

We can plot the pairwise distances between trees with table.image

table.image(comparisons$D, nclass=25)

Now lets print the PCA and clusters, this shows us how the group of trees cluster

plotGroves(comparisons$pco, lab.show=TRUE, lab.cex = 1.5)

groves <- findGroves(comparisons, nclust=4)
plotGroves(groves)

Extracting and working with subtrees using APE

Let’s load our required packages!

library(ape)

Now lets load the tree data we will be working with

newick <- read.tree("mammal_tree.nwk.txt")

l <- subtrees(newick)

Let’s plot the tree to see what it looks like

plot(newick)

We can subset this plot using the “node” function

plot(1[[4]], sub= "Node 4")

Extract the tree manually

small_tree <- extract.clade(newick, 9)

plot(small_tree)

Now what if we want to bind two trees together

new_tree <- bind.tree(newick, small_tree, 3)
plot(new_tree)

Reconstructing trees from alignments

Let's load the neccesary packages!
library(Biostrings)
library(msa)
library(phangorn)

First we’ll load the sequences into a seqs variable

seqs <- readAAStringSet("abc.fa")

Now, let’s construct an alignment with the msa package and ClustalOmega

aln <-msa::msa(seqs, method=c("ClustalOmega"))

To create a tree, we need to convert the alignment to a phyData objects

aln <- as.phyDat(aln, type = "AA")

class(aln)

In this step we’ll actually make the trees. Trees are made from a distance matrix, which can be computed with dist.ml(). ML stands for maximum likelihood

dist_mat <- dist.ml(aln)

Now we pass the distance matrix to upgma to make a UPGMA tree

upgma_tree <- upgma(dist_mat)
plot(upgma_tree, main="UPGMA")

We can conversely pass the distance matrix to a neighbor joining function

nj_tree <-NJ(dist_mat)
plot(nj_tree, main="NJ")

Lastly, we are going to use the bootstraps.phyDat() function to compute bootstrap support for the branches of the tree. The first argument is the object (aln), while the second argument is the function nj

Bootstraps are essentially confidence intervals for how the clade is placed in the correct position

fit <-pml(nj_tree, aln)

bootstraps <- bootstrap.phyDat(aln, FUN=function(x) (NJ(dist.ml(x))), bs=100)

plotBS(nj_tree, bootstraps, p=10)

First, Let’s load the required libraries

library(GenomicRanges)
library(gmapR)
library(rtracklayer)
library(VariantAnnotation)
library(VariantTools)

Now we want to load our datasets. We need .Bam and .fa files for this pipeline to work

bam_file <- file.path(getwd(), "hg17_snps.bam")

fasta_file <- file.path(getwd(), "chr17.83k.fa")

Now we need to set up the genome object and the parameters object:

fa <- rtracklayer::FastaFile(fasta_file)

Now we create a GMapGenome object, which describes the genome to the later variant calling function

genome <- gmapR::GmapGenome(fa, create=TRUE)

This next step sets our parameter for what is considered a variant. There can be a lot of fine-tuning to this funciton. We are just going to use the standard settings

qual_params <- TallyVariantsParam(
  genome=genome, 
  minimum_mapq=20)

var_params <- VariantCallingFilters(read.count=19, p.lower=0.01)

Now we use callVariants function in accordance with our parameters we defined above

called_variants <- callVariants(bam_file,
                                qual_params,
                                calling.filter = var_params)

head(called_variants)

We have identified 6 variants

Now, we move on to annotation and load the feautre position information from gff

get_annotated_regions_from_gff <-function(file_name) {
  gff <- rtracklayer::import.gff(file_name)
  as(gff, "GRanges") 
}

Note that you can also load this data from a bed file.

genes <-get_annotated_regions_from_gff("chr17.83k.gff3")

Now we can calculate which variants overlap which genes

overlaps <- GenomicRanges::findOverlaps(called_variants, genes)

overlaps

And lastly, we subset the genes with the list of overlaps

genes[subjectHits(overlaps)]

Open Reading Frames

First thing, lets load the required packages

library(Biostrings)
library(systemPipeR)

Let’s load the data into a DNAStrings object, in this case, an Arabdopsis chloroplasts genome

dna_object <- readDNAStringSet("arabidopsis_chloroplast.fa.txt")

Now, let’s predict the open reading frames with predORF(), we’ll predict all ORF on both strands

predict_orfs <- predORF(dna_object, n='all', type='gr', mode='ORF', strand='both',
                        longest_disjoint=TRUE)
predict_orfs

This printed out a GRanges object in return, with 2, 501 open reading frames. This is FAR too many open reading frames

To filter out erronous ORFs, we do a simulation. We first estimate the length an ORF can reach by chance. We will create a string of random nucleotides that is the length of our chloroplast genome and determine the longer ORF that can arise by chance.

bases <- c("A", "T", "G", "C")
raw_seq_string <- strsplit(as.character(dna_object), "")

Now we need to ensure that our random nucleotides match the proportion of nucleotides in our chloroplast genome so we have no bias

seq_length <- width(dna_object[1])
counts <-lapply(bases, function(x) {sum(grepl(x, raw_seq_string))})
probs <- unlist(lapply(counts, function(base_count){signif(base_count/seq_length, 2)}))
probs

Now we can build our function to simulate a genome

get_longest_orf_in_random_genome <-function(x,
                                           length=1000,
                                           probs=c(0.25, 0.25, 0.25, 0.25),
                                           bases =c("A", "T", "G", "C"))

Here, we create our random genome and allow replacement for the next iteration

random_genome <-paste0(sample, bases, size=length, replace=TRUE, prob=probs, collapse="")
random_dna_object <-DNAStringSet(random_genome)
names(random_dna_object)<-c("random_dna_string")
  
  1orfs <-predORF(random_dna_object, n=1, type='gr', mode='ORF', strand='both', longest_distjoint=TRUE)
  return(max(width(orfs)))

Now we use the function we just created to predit the ORFs in 10 random genomes

random_lengths <-ulist(lapply(1:10, get_longest_orf_in_random_genome, length=seq_length, prob=probs, bases=bases))

Let’s pull the longst length from our 10 simulations

longest_random_orf <-max(random_lengths)

Let’s only keep the frames that are longer in our chloroplast genome

keep <- width(predict_orfs) > longest_random_orf

orfs_to_keep <- predict_orfs[keep]
orfs_to_keep

Write this data to file

extracted_orfs <- BSgenome::getSeq(dna_object, orfs_to_keep)
names(extracter_orfs) <-paste0("orf_", 1: legnth(orfs_to_keep))
writeXStringSet(extracted_orfs, "saved_orfs.fa")

Karyoplot R

First, let’s load the required packages

library(karyoploteR)
library(GenomicRanges)

Now we need to set up the genome object for our karyotype

genome_df <-data.frame(
  # First, we dictate the number of chromosomes
  chr= paste0("chr", 1:5),
  start=rep(1, 5),
  # And then we will dictate the legnth of each chromosome
  end = c(34964571, 22037565, 25499034, 20862711, 31270811)
)

Now we convert the dataframe to a granges object

genome_gr <-makeGRangesFromDataFrame(genome_df)

Now let’s create some snp positions to map out. We do this by using the sample() function

snp_pos <-sample(1:1e7, 25)
snps <- data.frame(
  chr=paste0("chr", sample(1:5, 25, replace=TRUE)),
  start=snp_pos,
  end=snp_pos
)

Again we convert the dataframe to granges

snps_gr <- makeGRangesFromDataFrame(snps)

Now lets create snp labels

snp_labels <- paste0("snp_", 1:25)

Here we will set the margins for our plot

plot.params <-getDefaultPlotParams(plot.type=1)

Here we will set the margins of our plot

plot.params$dataloutmargin <-600

Now let’s plot our snps

kp <-plotKaryotype(genome= genome_gr, plot.type=2, plot\\]]= plot.params)
kpPlotMarkers(kp, snps_gr, labels=snp_labels)

We can also add some numeric data to our plots. we will generate 100 random numbers that plot to 100 windows on chromosome 4

numeric_data <-data.frame(
  y=rnorm(100 mean=1, sd=0.5),
  chr=rep("chr4", 100),
  start=seq(1,20862771, 20862711/100),
  end=seq(1,20862771, 20862711/100)
)

Now let’s make the data a granges object

numeric_data_gr <- makeGRangesFromDataFrame(numeric_data)

Again, lets set our plot parameters

plot.params<-getDefaultPlotParams(plot.type = 2)
plot.params$data1outmargin<-800
plot.params$data2outmargin<-800
plot.params$topmargin<-800

Let’s plot the data

kp <-plotKaryotype(genome=genome_gr, plot.type=2, lot.params = plot.params)
kpPlotMarkers(kp, snps_gr, labels=snp_labels)
kpLines(kp, numeric_data_gr, y=numeric_data$y, data.panel=2)

Final_Project

Gwyneth

5/9/2022

R Markdown

Table of Contents

Tabs

Themes

Code Folding

Data Wrangling with R

Arrange

Select

Renaming

Mutate

Summarize and by_group()

Missing Data

Counts

Piping

Tibbles

Subsetting

Tidyr

Spreading and Gathering

Spreading

Separating and Pull

Unite

Missing Values

DPLYR

Keys

Mutate join

Other types of joins

Stringr

Subsetting Strings

Regular Expression

Detect Matches

Bioinformatics with R

Limma Microarray Analysis

Differential Gene Expression Analysis

Pathview

Let’s Map to Pathways (great)

Let’s Map to Pathways (lesser)

EdgeR

DESeq

EdgeR vs. DESeq

Multiple Protein Alignment

Assay

Treeio

Most Common Recent Ancestor (MCRA)

Quantifying difference between trees with treespace

Extracting and working with subtrees using APE

Reconstructing trees from alignments

Open Reading Frames

Karyoplot R