This is tutorial on how to use R Markdown for reproducible research.
Here we can type long passages or descriptions of our data without the need of “hashing” out our comments with the # symbol. In our first example, we will be using the ToothGrowth dataset. In this experiment, Guinea Pigs (literal) were given different amounts of Vitamin C to see the effects on the animal’s tooth growth.
To run R code in a markdown file, we need to denote the section that is consdiered R code. We call these sections “code chunks.”
Below is a code chunk:
Toothdata <- ToothGrowth
head(Toothdata)
As you can see, from running the “play” button on the code chunk, the results are printed inline of the R Markdown file.
```{r, fig.cap= “Figure 1: The tooth growth of Guinea Pigs when given variable amounts of Vitamin C”}
fit <- lm(len ~ dose, data = Toothdata)
b <- fit$coefficients
plot(len ~ dose, data = Toothdata)
abline(lm(len ~ dose, data = Toothdata))
The slope of the regression line is `r b[2]`
## Section Headers
We can also put sections and subsections in our R Markdown file. Similar to numbers or bullet points in a word document. This is done with the "#" that we previously used to denote text in an R script.
# First level header
## Second level header
### Third level header
Make sure that you put a space after the hash tag, otherwise it will not work!
We can also add bullet point-type marks in our R Markdown file.
- One item
- One item
- One item
- One more item
- One more item
- One more item
- One last item
It is important to note here that in R Markdown, indentation matters!!
1. First Item
2. Second Item
3. Third Item
a. Subitem One
b. Subitem Two
c. Subitem Three
## Block Quotes
We can put really nice quotes into the markdown document. We do this by using the ">" symbol.
> "Genes are like the story, and DNA is the language that the story is written in."
>
>--- Sam Kean
## Hyperlinks
Hyperlinks can also be incorporated into these files. This is especially useful in HTML files, since they are in the web browser and will redirect the reader to the material that you are interested in showing them. Here we will use the link to R Markdown's Homepage for this example.
[Markdown](http://rmarkdown.rstudio.com/)
## Formulas
We can also put nice formatted formulas into Markdown using two dollar signs.
Hard-Weinberg Formula
$$p^2 + 2pq + q^2 = 1$$
And you can get really complex as well!
$$\Theta = \begin{pmatrix}\alpha & \beta\\
\gamma & \delta
\end{pmatrix}$$
## Code Chunks {.tabset}
## Code Chunk Options
```{r}
print("Hello World")
There are also options for your R Markdown file on how knitr interprets the code chunk. There are the following options.
Eval (T or F): Whether or not to evaluate the code chunk.
Echo (T or F): Whether or not to show the code for the chunk, but results will still print.
Cache: If enable, the same code chunk will not be evaluated the next time that the knitr is run. Great for code that has LONG runtimes
fig.width or fig.height: the (graphical device) size of the R plots in inches. The figures are first written to the knitr document then to files that are saved separately.
out.width or out.height: The output size of the R plots in the R document
fig.cap: The words for the figure caption.
We can also add a table of contents to our HTML Document. We do this by altering the YAML code (the weird code chunk at the VERY top of the document.) We can add this:
title: "HTML_Tutorial"
author: "Gwyneth"
date: "5/9/2022"
output:
html_document:
toc: true
toc_float: true
This will give us a very nice floating table of contents on the right hand side of the document.
You can also ad TABS in our report. To do this you need to specify each section that you want to become a tab by placing “{.tabset}” after the line. Every subsequent header will be a new tab.
You can also add themes to your HTML document that change the highlighting color and hyperlink color fo your html output. THis can be nice asthetically. To do this, you change your theme in YAML to one of the following:
Cerulean Journal Flatly Readable Spacelab United Cosmo Lumen Paper Sandstone Simplex Yeti Null
You can also change the color by specifying highlight:
Default Tango Payments Kate Monochrome Espresso Zenburn Haddock Textmate
You can also use the code_folding option to allow the reader to toggle betweeen displaying the code and hiding the code. This is done with:
code_folding: hide
There are a ton of options and ways for you to customize your R code using HTML format. This is also a great way to display a “portfolio” of your work if you are trying to market yourself to interested parties.
library(tidyverse)
??flights
my_data <- nycflights13::flights
head(my_data)
First we will just look at the data on October 14th
filter(my_data, month == 10,day ==14)
If we want to subset this into a new variable, we do the following:
oct_14_flight <- filter(my_data, month == 10, day ==14)
What if you want to do both print and save the variable?
(oct_14_flight2 <- filter(my_data, month == 10, day ==14))
If you want to filter based on different operators, you can use the following: Equals == Not equal to != Greater than > Less than < Greater than or equal to >= Less than or equal to <=
(flights_through_september <- filter(my_data, month < 10))
If we do not use the == to mean equals, we get this: (oct_14_flight2 <- filter(my_data, month = 10, day = 14))
You can also use logical operators to be more selective
And (&) Or (|) Not (!)
Let’s use the “or” function to pick flights in March and April
March_April_Flights <- filter(my_data, month ==3 | month ==4)
March_4th_Flights <- filter(my_data, month ==3 & day ==4)
Non_Jan_Flights <- filter(my_data, month !=1)
Arrange allows us to arrange the dataset based on the variables we desire.
arrange(my_data, year, day, month)
We can also do this in descending fashion
descending <-arrange(my_data, desc(year), desc(day), desc(month))
Missing values are always placed at the end of the dataframe regardless of ascending or descending.
We can also select specific columns that we want to look at.
calendar <- select(my_data, year, month, day)
print(calendar)
We can also look at a range of columns
calender2 <- select(my_data, year:day)
Let’s look at all columns months through carrier
calender3 <- select(my_data, year:carrier)
We can also choose which columns not to include
Everthing_else <- select(my_data, -(year:day))
In this instance we can also use the “not” operator !
Everthing_else2 <- select(my_data, !(year:day))
There are also some other helper functions that can help you select the columns or data you’re looking for: Starts_with(“xyz”) – Will select the values that start with “xyz” Ends_with”xyz”) — Will select the values that end with xyz Contains(“xyz”) —- Will select the values that contain xyz Matches(“xyz”) —– Will match the identical value xyz
Here, we see that we can rename the data:
head(my_data)
rename(my_data, departure_time = dep_time)
my_data2 <- rename(my_data, departure_time = dep_time)
What if you want to add new columns to your data frame? We have the mutate() function for that.
First, let’s make smaller data frame so we can see what we’re doing
my_data_small <- select(my_data, year:day, distance, air_time)
Let’s calculate the speed of the flights
mutate(my_data_small, speed = distance / air_time *60)
my_data_small <-mutate(my_data_small, speed = distance / air_time *60)
What if we wanted to create a new dataframe with ONLY your calculations (transmute)?
airspeed <- transmute(my_data_small, speed = distance/air_time* 60, speed2=distance / air_time)
We can use summarize to run a function on a data column to get a single return
summarize(my_data, delay = mean(dep_delay, na.rm =TRUE))
We gain additional value in summarize by pairing it with by_group()
by_day <- group_by(my_data, year, month, day)
summarize(by_day, delay = mean(dep_delay, na.rm=TRUE))
As you can see, we now have the delay by the days of the year
What happens if we don’t tell R what to do with the missing data?
summarize(by_day, delay = mean(dep_delay))
We can filter our data based on NA (which in this dataset was cancelled flights)
not_cancelled <- filter(my_data, !is.na(dep_delay), !is.na(arr_delay))
Let’s run summarize again on this data
summarize(not_cancelled, delay = mean(dep_delay))
We can count the numnber of variables that are NA
sum(is.na(my_data$dep_delay))
We can also count the numbers that are NOT NA
sum(!is.na(my_data$dep_delay))
With tibble datasets, we can pipe results to get rid of the need to use the dollar sign We can then summarize the number of flights by minutes delayed
my_data %>%
group_by(year, month, day) %>%
summarize(mean = mean(dep_time, na.rm =TRUE))
Now we will take the time to explore tibbles!
Tibbles are modified dataframes which tweak some of the older features from dataframes. R is an old language, and useful things from 20 years ago are just not as useful anymore
as_tibble(iris)
As we can see, we have the same dataframe, but we have different features.
You can also create a tibble from scratch with tibble()
tibble(
x=1:5,
y=1,
z=x^2+y
)
You can also use tribble() for basic data table creation
tribble(
~genea, ~geneb, ~genec,
#######################
110, 112, 114,
6, 5, 4,
)
Tibbles are built to not overwhelm your console when printing data, only showing the first few lines
This is how a dataframe prints:
print(by_day)
as.data.frame(by_day)
head(by_day)
nycflights13::flights %>%
print(n=10, width=Inf)
Subsetting tibbles is easy, and is very similar to data frames
df_tibble <-tibble(nycflights13::flights)
df_tibble
We can subset by column name using the (\() ```{r} df_tibble\)carrier
We can also subset by position sing [[]]
```{r}
df_tibble[[2]]
If you want to use this in a pipe, you need to use the “.” placeholder
df_tibble %>%
.$carrier
Some older functions do not like tibbles. Thus, you might have to convert them back to dataframes
class(df_tibble)
df_tibble_2 <- as.data.frame(df_tibble)
class(df_tibble_2)
df_tibble
head(df_tibble_2)
library(tidyverse)
How do we make a tidy dataset? Well the tidyverse follows three rules: #1 - Each variable must have its own column #2 - Each observation has its own row #3 - Each value has its own cell
It is impossible to satisfy two of the three rules. This leads to the following instructions for tidy data:
#1 put each dataset into a tibble #2 put each varible into a column #3 profit
Picking on econsistent method of data storage makes for easier understanding of your code and what is happening “under the hood” or behind the scenes.
Let’s now look at working with tibbles!
bmi <- tibble(women)
bmi %>%
mutate(bmi = (703 * weight)/(height)^2)
Sometimes you’ll find datasets that don’t fit well into a tibble. We’ll use the built-in data from tidyverse for this part.
table4a
As you can see from this data, we have one variable in column A (country) but column B and C are two of the same. Thus, there are two observations in each row.
To fix this, we can use the gather function
table4a %>%
gather("1999", "2000", key="year", value="cases")
Let’s look at another example:
table4b
As you can see, we have the same problem in table4b:
table4b %>%
gather("1999", "2000", key="year", value="population")
Now what if we want to join these two tables? We can use dplyr!
table4a <-table4a %>%
gather("1999", "2000", key="year", value="cases")
table4b <-table4b %>%
gather("1999", "2000", key="year", value="population")
left_join(table4a, table4b)
Spreading is the opposite of gathering! Let’s look at table2
print(table2)
You can see that we have redundant info in columns 1 and 2! We can fix that by combining rows 1 and 2, 3 and 4, etc.
spread(table2, key=type, value=count)
Type is the key of what we are turning into columns, the value is what becomes rows/observations. In summary, spread makes long tables shorter and wider. Gather makes wide tables, narrower and longer.
Now what happens when we have two observations stuck in one column?
print(table3)
As you can see, the rate is just the population and cases combined. We can use separate to fix this:
table3 %>%
separate(rate, into=c("cases", "population"))
However, if you notice, the column type is not correct
table3 %>%
separate(rate, into=c("cases", "populate"), conver=TRUE)
You can specify what you want to separate based on.
table3 %>%
separate(rate, into =c("cases", "populate"), sep= "/", conver=TRUE)
Now, let’s make this look more tidy
table3 %>%
separate(
year,
into=c("cases", "population"),
convert=TRUE,
sep= 2
)
What happens if we want to do the inverse of separate? We use unite.
table5
table5 %>%
unite(date, century, year)
table5 %>%
unite(data, century, year, sep= "")
There can be two types of missing values, NA (explicit) or just no entry (implicit)
gene_data <- tibble(
gene= c("a", "a", "a", "a", "b", "b", "b"),
nuc =c(20, 22, 24, 25, NA, 42, 67),
run =c(1,2,3,4,2,3,4)
)
gene_data
The nucleotide count for Gene B Run 2 is explicit missing. The nucleotide count for Gene B Run 1 is implicitly missing.
One way we can make implicit missing values explicit is by putting runs in columns
gene_data %>%
spread(gene, nuc)
If we want to remove the missing values, we can use spread and gather, and na.rm = TRUE
gene_data %>%
spread(gene, nuc) %>%
gather(gene, nuc, "a":"b", na.rm = TRUE)
Another way that we can make implicit values explicit, is complete()
gene_data %>%
complete(gene, run)
Sometimes an NA is present to represent a value being carried forward
treatment <- tribble(
~person, ~treatment, ~response,
#################################################
"Issac", 1, 7,
NA, 2, 10,
NA, 3, 9,
"VBD", 1, 8,
NA, 2, 11,
NA, 3, 10,
)
treatment
What we can do here is use the fill() option
treatment%>%
fill(person)
It is rare that you will be working with a single data table. THE DPLYR package allows you to join two data tables based on common values!
Mutate joins - add new variables to one data frame from the matching observations in another Filtering joins - filters observation from one data frame based on whether or note #They are present in another Set operations - treats observations as they are set elements
library(tidyverse)
library(nycflights13)
Let’s pull full carrier names based on letter codes
airlines
Let’s get info about airports
airports
Let’s get info about each plane
planes
Let’s get some info on the weather at the airports
weather
Let’s get info on singular flights
flights
Let’s look at how these tables connect: flights -> planes based on tailnumber flights -> airlines through carrier flights -> airports origin and dest flights -> weather via origin, year/month/day/hour
Keys are unique identifiers per observation. Primary key uniquely identifies an observation in its own table.
One way to identify a primary key is as follows:
planes %>%
count(tailnum) %>%
filter(n>1)
This indicates that the tailnumber is unique!
planes %>%
count(model) %>%
filter(n>1)
flights2 <- flights %>%
select(year:day, hour, origin, dest, tailnum, carrier)
flights2
flights2 %>%
select(~origin, ~dest) %>%
left_join(airlines, by = "carrier")
We’ve now added the airline name to our data frame from the airline data frame!
Inner joins (inner_join) matches a pair of observations when their key is equal Outer joins (outer_join) keeps observations that appear in at least one table.
library(tidyverse)
library(tringr)
You can create strings using single or double quotes!
string1 <- "this is a string"
string2 <- 'To put a "quote" in your string, use the opposite'
string1
string2
If you forget to close your string, you’ll get this:
string3 <- "Where is this going?"
string3
Just hit escape and try again! Multiple strings are stored in character vetors!
string4 <- c("one", "two", "three")
string4
Measuring string length:
str_length(string3)
str_length(string4)
Let’s combine two strings!
str_c("x", "y")
str_c(string1, string2)
You can use sep to control how they are separated!
str_c(string1, string2, sep = " ")
str_c("x", "y", "z", sep = "_")
You can subset a string using str_sub()
HSP <- c("HSP123", "HSP234", "HSP456")
str_sub(HSP, 4,6)
This just drops the first four letters from the strings!
Or you can use negatives to count back from the end
str_sub(HSP, -3, -1)
You can convert the cases of strings like follows:
HSP
str_to_lower(HSP)
str_to_upper()
install.packages("htmlwidgets")
x <- c('ATTAGA', 'CGCCCCCCGGAT', 'TATTA')
str_view(x, "G")
str_view(x, "TA")
The next step is, “.” where the “.” matches an entry
str_view(x, ".G.")
Anchors allow you to match at the start or the ending
str_view(x, "^TA")
str_view(x, "TA$")
Character classses/alternatives atches any digit matches any space [abc] matches a, b, or c
str_view(x, "TA[GT]")
[^anc] matches anything BUT a, b, or c
str_view(x, "TA[^T]")
You can also use | to pick between two alternatives
str_view(x, "TA[G|T]")
str_detect() returns a logical vector the same length of input
y <-c("apple", "banana", "pear")
y
str_detect(y, "e")
How many common words start with letter e?
words
sum(str_detect(words, "e"))
Let’s get more complex, what proportion words end in a vowel?
mean(str_detect(words, "[aeiou]$"))
mean(str_detect(words, "^[aeiou]"))
Let’s find all the words that don’t contain “o” or “u”
no_o <- !str_detect(words, "[ou]")
no_o
Now let’s extract!
words[!str_detect(words, "[ou]")]
You can also use str_count() to say how many matches there are in string
x
str_count(x, "[GC]")
Let’s couple this with mutate:
df <- tribble(
word = words,
count = seq_along(word)
)
df
df %>%
mutate(
vowels = str_count(words, "[aeiou]"),
constonants = str_count(words, "[^aeiou]")
)
First, we need to load the necessary packages!
setwd("C:~/Desktop/classroom/myfiles/Bric16_Targets.csv")
#setwd("~/Desktop/classroom/myfiles")
library(ath1121501cdf)
library(ath1121501.db)
library(annotate)
library(affy)
library(limma)
library(oligo)
library(dplyr)
library(stats)
library(reshape)
Read the cel files into the directory
targets <- readTargets("Bric16_Targets.csv", sep= ",", row.names="filename")
ab <-ReadAffy()
#####################################
hist(ab)
Normalizing the microarray experiments:
eset <-affy::rma(ab)
pData(eset)
Let’s visualize the normalized data!
hist(eset)
Let’s characterize the data a bit!
ID <-featureNames(eset)
Symbol <-getSYMBOL(ID, "ath1121501.db")
affydata <-read.delim("affy_ATH1_array_elements.txt")
Flight vs. Ground
condition <- targets$gravity
design <-model.matrix(~factor(condition))
colnames(design) <-c("flight", "ground")
fit <-lmFit(eset, design)
fit <-eBayes(fit)
options(digits=2)
top <- topTable(fit, coef=2, n=Inf, adjust='fdr')
Let’s combine annotations!
Annot <- data.frame(GENE = sapply(contents(ath1121501GENENAME), paste, collapse = ", "),
KEGG = sapply(contents(ath1121501PATH), paste, collapse = ", "),
GO = sapply(contents(ath1121501GO), paste, collapse = ", "),
SYMBOL = sapply(contents(ath1121501SYMBOL), paste, collapse = ", "),
ACCNUM = sapply(contents(ath1121501ACCNUM), paste, collaspe = ", "))
Let’s merge all the data into one dataframe
all <-merge(Annot, top, by.x=0, by.y=0, all=TRUE)
all2 <-merge(all, affydata, by.x ="Row.names", by.y ="array_element_name")
Let’s convert the ACCNUM to a character value
all2$ACCNUM <-as.character(all$ACCNUM)
write.table(all2, file="BRIC_16_Final.csv", row.names=TRUE, sep="\t")
columns(org.At.tair.db)
foldchanges <-as.data.frame(all$logFC)
all2$entrez=mapIds(org.At.tair.db,
keys=all2$ACCNUM,
column="ENTREZID",
keytype ="TAIR",
multiVals = "first")
head(all2, 10)
library(pathview)
library(gage)
library(gageData)
data(kegg.sets.hs)
foldchanges=all2$logFC
names(foldchanges) = all2$entrez
head(foldchanges)
kegg.ath=kegg.gsets("ath", id.type="entrez")
kegg.ath.sigmet = kegg.ath$kg.sets[kegg.ath$sigmet.idx]
Let’s get the results!
keggres = gage(foldchanges, gsets=kegg.ath.sigmet, same.dir=TRUE)
lapply(keggres, head)
Let’s find the greaters and lessers!
greater <- keggres$greater
lessers <- keggres$less
write.csv(greater, file="BRIC_16_pathview_Greater_Pathways.csv")
write.csv(lessers, file="BRIC_16_pathview_Lesser_Pathways.csv")
keggrespathways = data.frame(id=rownames(keggres$greater), keggres$greater) %>%
tbl_df() %>%
filter(row_number() <=5) %>%
.$id %>%
as.character()
keggrespathways
keggresids_greater=substr(keggrespathways, start=1, stop=8)
keggresids_greater
genedata <-as.character(all$logFC)
Define a plotting function to apply next!
plot_pathway = function(pid) pathview(gene.data=foldchanges, pathway.id=pid, species="ath", new.signature=FALSE)
Plot multiple pathways! (plots are saved to disk and returns a throwaway object list)
tmp=sapply(keggresids_greater, function(pid) pathview(gene.data=foldchanges, pathway.id=pid, species="ath"))
keggrespathways = data.frame(id=rownames(keggres$lessers), keggres$lessers) %>%
tbl_df() %>%
filter(row_number() <=5) %>%
.$id %>%
as.character()
keggrespathways
keggresids_lessersstr(keggrespathways, start=1, stop=8)
keggresids_lessers
genedata <-as.character(all$logFC)
Define a plotting function to apply next!
plot_pathway = function(pid) pathview(gene.data=foldchanges, pathway.id=pid, species="ath", new.signature=FALSE)
Plot multiple pathways (plots are saved to disk and returns a throwaway object list)
tmp=sapply(keggresids_lessers, function(pid) pathview(gene.data=foldchanges, pathway.id=pid, species="ath"))
Let’s load the necessary packages for EdgeR!
library("edgeR")
library("dplyr")
library("AnnotationDbi")
library("org.Mm.eg.db")
Let’s analyze our rawdata!
rawdata=read.csv("~/Desktop/classroom/myfiles/GLDS-102_rna_seq_Normalized_Counts.csv", row.names="gene_id")
group <-factor(c('1', '1', '1', '1', '1', '1', '2', '2', '2', '2', '2', '2'))
dgeGlm <-DGEList(counts=rawdata, group=group)
str(dgeGlm)
keep <- rowSums(cpm(dgeGlm)>2) >=4
dgeGlm <- dgeGlm[keep,]
names(dgeGlm)
dgeGlm[["samples"]]
design <-model.matrix(~group)
design
dgeGlmComDisp <- estimateGLMCommonDisp(dgeGlm, design, verbose=TRUE)
dgeGlmTrendDisp <- estimateGLMTrendedDisp(dgeGlmComDisp, design)
dgeGlmTagDisp <-estimateGLMTagwiseDisp(dgeGlmTrendDisp, design)
plotBCV(dgeGlmTagDisp)
fit <- glmFit(dgeGlmTagDisp, design)
colnames(coef(fit))
lrt <-glmLRT(fit, coef=2)
ttGlm <-topTags(lrt, n=Inf)
class(ttGlm)
Let’s summarize the data!
summary(deGlm <- decideTestsDGE(lrt, p=0.1, adjust="fdr"))
tagsGlm <- rownames(dgeGlmTagDisp)[as.logical(deGlm)]
plotSmear(lrt, de.tags=tagsGlm)
hits2 <- ttGlm$table[ttGlm$table$FDR < 0.1,]
write.csv(hits2, "Mouse_EdgeR_Results_Zero_vs_1.csv")
Let’s organize our data!
columns(org.Mm.eg.db)
ttGlm2 <- as.data.frame(ttGlm$table)
ttGlm2$symbol = mapIds(org.Mm.eg.db,
keys=row.names(ttGlm2),
column="SYMBOL",
keytype = "ENSEMBL",
mutivals = "first")
ttGlm2$entrez = mapIds(org.Mm.eg.db,
keys=row.names(ttGlm2),
column="ENTREZID",
keytype = "ENSEMBL",
mutivals = "first")
ttGlm2$name = mapIds(org.Mm.eg.db,
keys=row.names(ttGlm2),
column="GENENAME",
keytype = "ENSEMBL",
mutivals = "first")
head(ttGlm2)
Let’s load the packages for our pathways!
library(pathview)
library(gage)
library(gageData)
data(kegg.sets.mm)
data(go.subs.mm)
foldchanges <- ttGlm2$logFC
names(foldchanges) <-ttGlm2$entrez
kegg.mm = kegg.gsets("mouse", id.type = "entrez")
kegg.mm.sigmet <-kegg.mm$kg.sets[kegg.mm$sigmet.idx]
Let’s get the results!
keggres=gage(foldchanges, gsets = kegg.mm.sigmet, same.dir =TRUE)
lapply(keggres, head)
greaters <- keggres$greater
lessers <- keggres$less
Let’s explore our pathways (greater)!
keggrespathways = data.frame(id=rownames(keggres$greater), keggres$greater) %>%
tbl_df() %>%
filter(row_number() <=5) %>%
.$id %>%
as.character()
keggrespathways
keggresids=substr(keggrespathways, start=1, stop=8)
keggresids
Let’s plot our pathway!
plot_pathway=function(pid) pathview(gene.data=foldchanges, pathway.id=pid, species = "mouse", new.signature=FALSE)
# Plot multiple pathways
tmp=sapply(keggresids, function(pid) pathview(gene.data=foldchanges, pathway.id=pid, species="mouse"))
Let’s explore the pathways (lesser)!
keggrespathways = data.frame(id=rownames(keggres$less), keggres$less) %>%
tbl_df() %>%
filter(row_number() <=5) %>%
.$id %>%
as.character()
keggrespathways
keggresids=substr(keggrespathways, start=1, stop=8)
keggresids
Let’s plot our pathways (lesser)!
plot_pathway=function(pid) pathview(gene.data=foldchanges, pathway.id=pid, species = "mouse", new.signature=FALSE)
Plot multiple pathways!
tmp=sapply(keggresids, function(pid) pathview(gene.data=foldchanges, pathway.id=pid, species="mouse"))
Let’s save the file!
library(imager)
filenames <- list.files(path = "E:/Bioinformatics/BISC_450_Scripts/mouse_edgeR", pattern=".*pathview.png")
all_images <-lapply(filenames, load.image)
Let’s use the knitr function, to include graphics!
knitr::include_graphics(filenames)
Lets load the required libraries for this analysis
BiocManager::install("apeglm", update=FALSE)
library("DESeq2")
library("dplyr")
library("apeglm")
Lets load in our data
countData <-read.csv("~/Desktop/classroom/myfiles/GLDS-102_rna_seq_Unnormalized_Counts.csv")
colData <- read.csv("~/Desktop/classroom/myfiles/PHENO_DATA_Mouse.csv", sep=",", row.names=1)
Now we need to add leveling factors to the colData
colData$group <-factor(colData$group, levels = c("0", "1"))
Now lets make sure we have the same number of individuals as well as groups
all(rownames(colData)) %in% colnames(countData)
We need to round up the counts, because DESeq2 only allows integers as an input, and not fractional numbers. This depends on the method that was used upstream. ```{r, results=‘hide’} countData %>% mutate_if(is.numeric, ceiling)
countData[, 2:13] <- sapply(countData[, 2:13], as.integer)
row.names(countData) <-countData[,1]
countData <- countData[2:13]
rownames(colData) == colnames(countData)
seq(dds)
res <- results(dds)
Let's summarize our data!
```{r}
resLFC <- lfcShrink(dds, coef=2)
(resOrdered <- res[order(res$padj), ])
summary(res)
Let’s load our data into a dataframe!
FLT_Vs_GC <- as.data.frame(res$log2FoldChange)
head(FLT_Vs_GC)
Let’s plot our data!
plotMA(resLFC, ylim=c(-2,2))
pdf(file ="MA_Plot_FLT_vs_GC.pdf")
dev.off()
Lets save our differential expression results to file.
write.csv(as.data.frame(resOrdered), file="Mouse_DESeq.csv")
Lets perform a custom transformation
dds <- estimateSizeFactors(dds)
se <- SummarizedExperiment(log2(counts(dds, normalize=TRUE) +1), colData=colData(dds))
pdf(file="PCA_PLOT_FLT_vs_GC.pdf")
plotPCA(DESeqTransform(se), intgroup="group")
Lets load our required packages
library(AnnotationDbi)
library(org.Mm.eg.db)
Let’s use foldchanges in our packages!
columns(org.Mm.eg.db)
foldchanges <-as.data.frame(res$log2FoldChange, row.names=row.names(res))
head(foldchanges)
res$symbol = mapIds(org.Mm.eg.db,
keys=row.names(res),
column="SYMBOL",
keytype="ENSEMBL",
multiVals = "first")
res$entrez = mapIds(org.Mm.eg.db,
keys=row.names(res),
column="ENTREZID",
keytype="ENSEMBL",
multiVals = "first")
res$name = mapIds(org.Mm.eg.db,
keys=row.names(res),
column="GENENAME",
keytype="ENSEMBL",
multiVals = "first")
head(res, 10)
Lets load our pathview packages!
library(pathview)
library(gage)
library(gageData)
data(kegg.sets.mm)
data(go.subs.mm)
foldchanges <- res$log2FoldChange
names(foldchanges) = res$entrez
head(foldchanges)
kegg.mm = kegg.gsets("mouse", id.type= "entrez")
kegg.mm.sigmet <-kegg.mm$kg.sets[kegg.mm$sigmet.idx]
Lets map the results!
keggres <- gage(foldchanges, gsets = kegg.mm.sigmet, same.dir = TRUE)
lapply(keggres, head)
Lets save our greater and less than pathways
greaters <-keggres$greater
lessers <- keggres$less
Let’s analyze our greater pathways!
keggrespathways <- data.frame(id = rownames(keggres$greater), keggres$greater) %>%
tbl_df() %>%
filter(row_number() <=3) %>%
.$id %>%
as.character()
keggrespathways
keggresids <-substr(keggrespathways, start=1, stop=8)
keggresids
Let’s plot our packages!
plot_pathway=function(pid) pathview(gene.data=foldchange, pathway.id=pid, species= "mouse", new.signature=FALSE)
tmp = sapply(keggresids, function(pid) pathview(gene.data=foldchanges, pathway.id=pid, species="mouse"))
Let’s format our lesser pathways!
keggrespathways <- data.frame(id = rownames(keggres$less), keggres$less) %>%
tbl_df() %>%
filter(row_number() <=3) %>%
.$id %>%
as.character()
keggrespathways
keggresids <-substr(keggrespathways, start=1, stop=8)
keggresids
Let’s plot our pathways!
plot_pathway=function(pid) pathview(gene.data=foldchange, pathway.id=pid, species= "mouse", new.signature=FALSE)
tmp = sapply(keggresids, function(pid) pathview(gene.data=foldchanges, pathway.id=pid, species="mouse"))
Let’s file our information into a document!
library(imager)
filenames <-list.files(path= "E:/Bioinformatics/Bisc_450_Scripts/mouse_Deseq", pattern=".*pathview.png")
all_images <- lapply(filenames, load.image)
Let’s use the knitr function to include graphics in our new file!
knitr::include_graphics(filenames)
Let’s load the necessary library for this!
library(tidyverse)
Let’s view the documents we created earlier!
EdgeR <- read.csv("~/Desktop/classroom/myfiles/ged015@latech.edu/Mouse_EdgeR_Results_Zero_vs_1.csv")
DESeq <- read.csv("~/Desktop/classroom/myfiles/ged015@latech.edu/Mouse_DESeq.csv")
DESeq2 <- DESeq %>%
filter(padj < 0.1)
DESeq2 <- DESeq2[,c(1,3)]
EdgeR <- EdgeR[, 1:2]
colnames(DESeq2) <-c("ID", "logFC")
colnames(EdgeR) <-c("ID", "logFC")
Install the “Goplot” package
install.packages("GOplot")
library(GOplot)
Plot and create a table using the previous information!
comp <-GOVenn(DESeq2, EdgeR, label =c("DESeq1", "EdgeR"), title="Comparison of DESeq and EdgeR DE Genes", plot=FALSE)
comp$plot
comp$table
Let’s load the library needed for this!
library(msa)
seq <-readAAStringSet("hglobin.fa")
seq
Let’s align the 8 different amino acid sequences
alignments <-msa(seq, method="ClustalW")
alignments
msaPrettyPrint(alignments, output="pdf",
showNames="left", showLogo="none", askForOverwrite = FALSE,
verbose=TRUE, file="whole_align.pdf")
msaPrettyPrint(alignments, c(10, 30), output="pdf", showNames="left",
file = "Zoomed_align.pdf", showLogo = "top", askForOverwrite = FALSE,
verbose=TRUE)
Let’s make a tree from our alignment
library(ape)
library(seqinr)
Convert the seqinr alignment -> get the distnaces and make a tree
alignment_seqinr <- msaConvert(alignments, type = "seqinr::alignment")
distances1 <- seqinr::dist.alignment(alignment_seqinr, "identity")
tree <- ape::nj(distances1)
plot(tree, main="Phylogenetic Tree of MBA Sequences")
Let’s load the necessary library!
library(DECIPHER)
In the first step, we load the libraries and sequence into long_seqs. This is a DNAStringSet object ~Desktop/classroom/myfiles
long_seq <- readDNAStringSet("plastid_genomes.fa")
long_seq
Now let’s build a temporary SQLite database
Seqs2DB(long_seq, "XStringSet", "long_db", names(long_seq))
Now that we have built the database, we can do the following: Find the syntenic blocks
synteny <- FindSynteny("long_db")
View blocs with plotting!
pairs(synteny)
plot(synteny)
Make an actual alignment file
alignment <- AlignSynteny(synteny, "long_db")
Let’s create a structure holding all aligned sytenic blocks for a pair of sequences
block <- unlist(alignment[[1]])
We can write to file one alignment at a time
writeXStringSet(block, "genome_blocks_out.fa")
Let’s load the neccesary libraries!
library(locfit)
library(Rsamtools)
Let’s create a function that will load the gene region information in a GFF file and convert it to a bioconductor GRanges object
gen_annotated_regions_from_gff <- function(file_name) {
gff <- rtracklayer::inport.gff(file_name)
as(gff, "GRanges")
}
Get count in windows across the genome in 500bp segments
whole_genome <- csaw::windowCounts(
# file.path(~Desktop/classroom/myfiles)
file.path(getwd(), "windows.bam"),
bin=TRUE,
filter=0,
width=500,
param=csaw::readParam(
minq=20, # Determines what is a passing read
dedup=TRUE, # Removes PCR duplicate
pe="both" # Requires that both pairs of reads are aligned
)
)
Since this is a single column of data, let’s rename it:
colnames(whole_genome) <-c("small_data")
annotated_regions <- get_annotated_regions_from_gff(file.path(getwd(), "genes.gff"))
Now that we have the windows of high expression, we want to see if any of them overlap with annotated regions.
library(IRanges)
library(SummarizedExperiment)
Find the overlaps between the windows we made
windows_in_genes <- IRanges::overlapsAny(
summarizedExperiment::rowRanges(whole_genome), # Creates a Granges object
annotated_regions
)
windows_in_genes
Here we subset the whole_genome_object into annotated and nonannotated regions
annotated_window_counts <- whole_genome[windows_in_genes,]
non_annotated_window_counts <- whole_genome[!windows_in_genes,]
Use assay () to extrac the actual counts from the Granges object
assay(non_annotated_window_counts)
In this setp, we use Rsamtools Pulesup() function to get a per-base coverage dataframe. Each row represents a single nucleotide in the reference count and the count column gives the depth of coverage at that point
library(bumphunter)
pile_df <- Rsamtools::pileup(file.path(getwd(), "windows.bam"))
This step groups the reads with certain distance of each other into a cluster, we give the sequence names, position and distance.
clusters <- bumphunter::clusterMaker(pile_df$seqnames, pile_df$pos, maxGap=100)
table(clusters)
In this step, we will map the reads to the regions we found for the genome
bumphunter::regionFinder(pile_df$count, pile_df$segnames, pile_df$pos, clusters, cutoff=1)
Let’s load the required packages
library(ggplot2)
library(ggtree)
library(treeio)
First, we need to load our raw tree data. Its a Newick format, so we use:
itol <- ape::read.tree("itol.nwk")
Now we will print our a very basic phylogenic tree
ggtree(itol)
We can also change the format to make it a circular tree
ggtree(itol, layout = "circular")
We can also change the left-right/up-down direction
ggtree(itol) + coord_flip() + scale_x_reverse()
By using geom_tipla() we can add names to the end of the tips
ggtree(itol) + geom_tiplab(color = "indianred", size =1.5)
By adding a geom_strip() layer, we can annotate clades in the tree with a block of color
ggtree(itol, layout = "unrooted") + geom_strip(13, 14, color = "red", barsize =1)
We can highlight clases in unrooted trees with blobs of color using geom_hilight
ggtree(itol, layout = "unrooted") + geom_hilight(node=11, type="encircle", fill ="steel blue")
Let’s install the necessary packages!
install.packages('BAMMtools')
library(BAMMtools)
We can use the MRCA function to find the node we want
getmrca(itol, "Photorhabdus_luminescens", "Blochmanmia_floridanus")
Now if we want to highlight the section of the most recent common ancestor between the two above
ggtree(itol, layout ="unrooted")+geom_hilight(node=206, type="encircle", fill="steel blue")
First, let’s load the required packages
library(ape)
library(adegraphics)
library(treespace)
Now we need to load all the treefiles into a mulitPhylo object
treefiles <- list.files(file.path(getwd(), "gene_trees"), full.names=TRUE)
tree_list <- lapply(treefiles, read.tree)
class(tree_list)
class(tree_list) <- "multiPhylo"
class(tree_list)
Now we can compute the kendall coljin distances between trees. This function does a LOT of analyses.
First, it runs a pairwise comparison of all trees in the input Second, it carries out clustering using PCA These results are returned in a list of objects, where [$D] contains the pairwise metric of the trees, and ($pco) contains the PCA. The we use (Kendal-Coljin) is particularly suitable for rooted trees as we have here. The option NF tells us how many principal components to retain.
comparisons <- treespace(tree_list, nf=3)
We can plot the pairwise distances between trees with table.image
table.image(comparisons$D, nclass=25)
Now lets print the PCA and clusters, this shows us how the group of trees cluster
plotGroves(comparisons$pco, lab.show=TRUE, lab.cex = 1.5)
groves <- findGroves(comparisons, nclust=4)
plotGroves(groves)
Let’s load our required packages!
library(ape)
Now lets load the tree data we will be working with
newick <- read.tree("mammal_tree.nwk.txt")
l <- subtrees(newick)
Let’s plot the tree to see what it looks like
plot(newick)
We can subset this plot using the “node” function
plot(1[[4]], sub= "Node 4")
Extract the tree manually
small_tree <- extract.clade(newick, 9)
plot(small_tree)
Now what if we want to bind two trees together
new_tree <- bind.tree(newick, small_tree, 3)
plot(new_tree)
Let's load the neccesary packages!
library(Biostrings)
library(msa)
library(phangorn)
First we’ll load the sequences into a seqs variable
seqs <- readAAStringSet("abc.fa")
Now, let’s construct an alignment with the msa package and ClustalOmega
aln <-msa::msa(seqs, method=c("ClustalOmega"))
To create a tree, we need to convert the alignment to a phyData objects
aln <- as.phyDat(aln, type = "AA")
class(aln)
In this step we’ll actually make the trees. Trees are made from a distance matrix, which can be computed with dist.ml(). ML stands for maximum likelihood
dist_mat <- dist.ml(aln)
Now we pass the distance matrix to upgma to make a UPGMA tree
upgma_tree <- upgma(dist_mat)
plot(upgma_tree, main="UPGMA")
We can conversely pass the distance matrix to a neighbor joining function
nj_tree <-NJ(dist_mat)
plot(nj_tree, main="NJ")
Lastly, we are going to use the bootstraps.phyDat() function to compute bootstrap support for the branches of the tree. The first argument is the object (aln), while the second argument is the function nj
Bootstraps are essentially confidence intervals for how the clade is placed in the correct position
fit <-pml(nj_tree, aln)
bootstraps <- bootstrap.phyDat(aln, FUN=function(x) (NJ(dist.ml(x))), bs=100)
plotBS(nj_tree, bootstraps, p=10)
First, Let’s load the required libraries
library(GenomicRanges)
library(gmapR)
library(rtracklayer)
library(VariantAnnotation)
library(VariantTools)
Now we want to load our datasets. We need .Bam and .fa files for this pipeline to work
bam_file <- file.path(getwd(), "hg17_snps.bam")
fasta_file <- file.path(getwd(), "chr17.83k.fa")
Now we need to set up the genome object and the parameters object:
fa <- rtracklayer::FastaFile(fasta_file)
Now we create a GMapGenome object, which describes the genome to the later variant calling function
genome <- gmapR::GmapGenome(fa, create=TRUE)
This next step sets our parameter for what is considered a variant. There can be a lot of fine-tuning to this funciton. We are just going to use the standard settings
qual_params <- TallyVariantsParam(
genome=genome,
minimum_mapq=20)
var_params <- VariantCallingFilters(read.count=19, p.lower=0.01)
Now we use callVariants function in accordance with our parameters we defined above
called_variants <- callVariants(bam_file,
qual_params,
calling.filter = var_params)
head(called_variants)
We have identified 6 variants
Now, we move on to annotation and load the feautre position information from gff
get_annotated_regions_from_gff <-function(file_name) {
gff <- rtracklayer::import.gff(file_name)
as(gff, "GRanges")
}
Note that you can also load this data from a bed file.
genes <-get_annotated_regions_from_gff("chr17.83k.gff3")
Now we can calculate which variants overlap which genes
overlaps <- GenomicRanges::findOverlaps(called_variants, genes)
overlaps
And lastly, we subset the genes with the list of overlaps
genes[subjectHits(overlaps)]
First thing, lets load the required packages
library(Biostrings)
library(systemPipeR)
Let’s load the data into a DNAStrings object, in this case, an Arabdopsis chloroplasts genome
dna_object <- readDNAStringSet("arabidopsis_chloroplast.fa.txt")
Now, let’s predict the open reading frames with predORF(), we’ll predict all ORF on both strands
predict_orfs <- predORF(dna_object, n='all', type='gr', mode='ORF', strand='both',
longest_disjoint=TRUE)
predict_orfs
This printed out a GRanges object in return, with 2, 501 open reading frames. This is FAR too many open reading frames
To filter out erronous ORFs, we do a simulation. We first estimate the length an ORF can reach by chance. We will create a string of random nucleotides that is the length of our chloroplast genome and determine the longer ORF that can arise by chance.
bases <- c("A", "T", "G", "C")
raw_seq_string <- strsplit(as.character(dna_object), "")
Now we need to ensure that our random nucleotides match the proportion of nucleotides in our chloroplast genome so we have no bias
seq_length <- width(dna_object[1])
counts <-lapply(bases, function(x) {sum(grepl(x, raw_seq_string))})
probs <- unlist(lapply(counts, function(base_count){signif(base_count/seq_length, 2)}))
probs
Now we can build our function to simulate a genome
get_longest_orf_in_random_genome <-function(x,
length=1000,
probs=c(0.25, 0.25, 0.25, 0.25),
bases =c("A", "T", "G", "C"))
Here, we create our random genome and allow replacement for the next iteration
random_genome <-paste0(sample, bases, size=length, replace=TRUE, prob=probs, collapse="")
random_dna_object <-DNAStringSet(random_genome)
names(random_dna_object)<-c("random_dna_string")
1orfs <-predORF(random_dna_object, n=1, type='gr', mode='ORF', strand='both', longest_distjoint=TRUE)
return(max(width(orfs)))
Now we use the function we just created to predit the ORFs in 10 random genomes
random_lengths <-ulist(lapply(1:10, get_longest_orf_in_random_genome, length=seq_length, prob=probs, bases=bases))
Let’s pull the longst length from our 10 simulations
longest_random_orf <-max(random_lengths)
Let’s only keep the frames that are longer in our chloroplast genome
keep <- width(predict_orfs) > longest_random_orf
orfs_to_keep <- predict_orfs[keep]
orfs_to_keep
Write this data to file
extracted_orfs <- BSgenome::getSeq(dna_object, orfs_to_keep)
names(extracter_orfs) <-paste0("orf_", 1: legnth(orfs_to_keep))
writeXStringSet(extracted_orfs, "saved_orfs.fa")
First, let’s load the required packages
library(karyoploteR)
library(GenomicRanges)
Now we need to set up the genome object for our karyotype
genome_df <-data.frame(
# First, we dictate the number of chromosomes
chr= paste0("chr", 1:5),
start=rep(1, 5),
# And then we will dictate the legnth of each chromosome
end = c(34964571, 22037565, 25499034, 20862711, 31270811)
)
Now we convert the dataframe to a granges object
genome_gr <-makeGRangesFromDataFrame(genome_df)
Now let’s create some snp positions to map out. We do this by using the sample() function
snp_pos <-sample(1:1e7, 25)
snps <- data.frame(
chr=paste0("chr", sample(1:5, 25, replace=TRUE)),
start=snp_pos,
end=snp_pos
)
Again we convert the dataframe to granges
snps_gr <- makeGRangesFromDataFrame(snps)
Now lets create snp labels
snp_labels <- paste0("snp_", 1:25)
Here we will set the margins for our plot
plot.params <-getDefaultPlotParams(plot.type=1)
Here we will set the margins of our plot
plot.params$dataloutmargin <-600
Now let’s plot our snps
kp <-plotKaryotype(genome= genome_gr, plot.type=2, plot\\]]= plot.params)
kpPlotMarkers(kp, snps_gr, labels=snp_labels)
We can also add some numeric data to our plots. we will generate 100 random numbers that plot to 100 windows on chromosome 4
numeric_data <-data.frame(
y=rnorm(100 mean=1, sd=0.5),
chr=rep("chr4", 100),
start=seq(1,20862771, 20862711/100),
end=seq(1,20862771, 20862711/100)
)
Now let’s make the data a granges object
numeric_data_gr <- makeGRangesFromDataFrame(numeric_data)
Again, lets set our plot parameters
plot.params<-getDefaultPlotParams(plot.type = 2)
plot.params$data1outmargin<-800
plot.params$data2outmargin<-800
plot.params$topmargin<-800
Let’s plot the data
kp <-plotKaryotype(genome=genome_gr, plot.type=2, lot.params = plot.params)
kpPlotMarkers(kp, snps_gr, labels=snp_labels)
kpLines(kp, numeric_data_gr, y=numeric_data$y, data.panel=2)