This is a tutorial on how to use R markdown or reproducible research

Here we can type long passages or descriptions of our data without the need of “hashing” out our comments with the # symbol. In our first example we will be using the ToothGrowth dataset. In this experiment, Guinea Pigs (literal) were given different amount of vitamin C to see the effects on the animal’s tooth growth.

To run R code in a markdown file, we need to denote the section that is considered R code. We call these “code chunks.”

Below is a code chunk:

Toothdata <- ToothGrowth

head(Toothdata)

As you can see from running the “play” button on the code chunk, the results are printed inline of the r markdown file.

fit <- lm(len ~ dose, data = Toothdata)

b <- fit$coefficients

plot(len ~ dose, data = Toothdata)

abline(lm(len ~ dose, data = Toothdata))

Figure 1: The tooth growth of Guinea Pigs when given variable amounts of Vitamin C

The slope of the regression line is 9.7635714 .

Section Headers

We can also put sections and subsections in our r markdown file similar to numbers or bullet points in a word document. This is done with the “#” that we previously used to denote text in an R script.

First level header

Second level header

Third level header

Make sure you put a space after the hastag, otherwise it will not work!

We can also add bullet point-type marks in our R markdown file.

one item
one item
one item
- one more item
- one more item
- one more item
  - one last item

Its important to note here that in R markdown identation matters!

First item
Second item
Third item

subitem 1
subitem 2
subitem 3

Block Quotes

we can put really nice quotes in the markdown document. We can do this by using the “>” symbol.

“Genes are like the story, and DNA is the language that the story in written in.”

—Sam Kean

Hyperlinks

Hyperlinks can also be incorporated into these files. This is especially useful in HTML files, since they are in a web browser and will redirect the reader to the material that you are interested in showing them. Here we will use the link to R markdown’s homepage for this example. Rmarkdown

Formulas

We can also put nice formatted formulas into Markdown by using two dollar signs

Hard-Weinburg Formula

\[p^2 + 2pq + q^2 = 1\]

And you can get really complex as we!!

\[\Theta = \begin{pmatrix}\alpha & \beta\\ \gamma & \delta \end{pmatrix}\]

Code Chunks

Code chunk options

There are aslo options for your R markdown file on how knitr interprets code chunk. There are the following options

Eval (T or F): whether or not to evaluate the code chunk.

Echo (T or F): whether or not show the code for the chunk, but results will still print.

Cache: If enable, the same code chunk will not be evaluated the next time the knitr is run. Great for code that has LONG runtimes.

fig.width or fig.height: the (graphical device) size of the R plots in inches. The figures are first written to the knitr document then to files that are saved separately.

out.width or out.height: The output of the size of the R plots IN THE R DOCUMENT

fig.cap: the words for the figure caption

We can also add a table of contents to our HTML document. We do this by altering the YAML code (the weird code chunk at the VERY top of the document) We can add this:

title: “HTML_Tutorial” author: “Lauren Green” date: “4/11/2022” output: html_document: toc: true toc_float: true

This will give us a very nice floating table of contents on the right hand side of the document.

Tabs

You can also add tabs in our report. To this you need to specify each section that you want to become a tab by placing “{.tabset}” after the line. Every subsequent header will be a new tab.

Themes

You can also add themes to your HTML document that change the highlighting color and hyperlink color of your html output. This can be nice aesthetically. TO do this, you change your theme and your YAML to one of the following:

cerulean journal flatly readable spacelab united cosmo lumen paper sandstone simplex yeti null

You can also change the color by specifing higlight:

default tango payments kate monochrome espresso zenburn haddock textmate

Code Folding

you can also use the code_folding option to allow the reader tot toggle between displaying the code and hiding the code. This is done with:

code_folding: hide

Summary

There are a TON of options and ways for you to customize your R code using the HTML format. This is also a great way to display a “portfolio of your work if you are trying to market yourself to interested parties.

Data Wrangling with R

library(tidyverse)

??flights

my_data <- nycflights13::flights

head(my_data)

First we will just look at the data on the october 14th.

filter(my_data, month == 10, day ==14)

If we want to subset this into a new variable we do the following: oct_14_flight <- filter(my_data, month == 10, day ==14)

what if you want to do both print and save the variable?

(oct_14_flight_2 <- filter(my_data, month == 10, day ==14))

If you want to filter based on different opperators, you can use the following:

Equals == not equal to != greater than > less than < greater than or equal to >= less than or equal to <=

(flight_through_september <- filter(my_data, month < 10))

If we don’t use the == to mean equals, we get this: (oct_14_flight_2 <- filter(my_data, month = 10, day =14))

you can also use logical opperators to be more selective and & or | not !

Lets use the “or” function to pick flights in March and April

March_April_Flights <- filter(my_data, month ==3 | month == 4)

March_4th_Flights <- filter(my_data, month == 3 & day == 4)

Non_jan_flights <- filter(my_data, month != 1)

Arrange

Arrange allows us to arrange the dataset nased on the variables we desire.

arrange(my_data, year, day, month)

we can also do this in descending fashion descending <- arrange(my_data, desc(year), desc(day), desc(month))

Missing values are always placed at the end of the dataframe regardless of ascending or descending

select

we can also select specific columns that we want to look at.

calendar <- select(my_data, year, month, day) print(calendar)

we can also look at a range of columns

calendar2 <- select(my_data, year:day)

lets look at all columns months through carrier calendar3 <- select(my_data, year:carrier)

we can also choose which columns NOT to include

everything_else <- select(my_data, -(year:day))

In this instance we can use the “not” opperator everything_else2 <- select(my_data, !(year:day))

there are also some other helper functions that can help you select the colums or data you’re looking for

starts_with(“xyz”) – will select the values that starts with xyz ends_with(“xyz”) — will select the values that ends with xyz contains(“xyz”) — will select the values that contain xyz matches (“xyz”) — will match the identical value xyz

renaming

head(my_data)

rename(my_data, departure_time = dep_time)

my_data <- rename(my_data, departure_time = dep_time)

mutate

what if you want to add new columns to your data frame? we have the mutate() function for that.

first, lets make smaller data fram so we can see what we’re doing

my_data_small <- select(my_data, year:day, distance, air_time)

lets calculate the speed of the flights. mutate(my_data_small, speed = distance / air_time * 60)

my_data_small <- mutate(my_data_small, speed = distance / air_time * 60)

what if we wanted to create a new dataframe with ONLY your calulations (transmute)

airspeed <- transmute(my_data_small, speed = distance / air_time * 60 , speed2 = distance / air_time)

summarize and by_group()

we can summarize to run a function on a data column to get a single return

summarize(my_data, delay = mean(dep_delay, na.rm = TRUE))

So we can see here that the average delay is about 12 minutes

we gain additional value in summarize by pairing it with by_group()

by_day <- group_by(my_data, year, month, day) summarize(by_day, delay = mean (dep_delay, na.rm = TRUE))

As you can see, we now have the delay by the days of the year

Missing Data

what happens if we don’t tell R what do to with the missing data? summarize(by_day, delay = mean (dep_delay))

we can also filter our data based on NA (which in this dataset was canelled flights)

not_cancelled <- filter (my_data, !is.na(dep_delay), !is.na(arr_delay))

lets run summarize again on this data summarize(not_cancelled, delay = mean(dep_delay))

counts

we can also count the number of variables that are NA

sum(is.na(my_data$dep_delay))

we can also count the numbers that are a NOT NA sum(!is.na(my_data$dep_delay))

Piping

with tibble datasets (more on them soon, we can pipe results to get rid of the need to use the dollar signs) we can then summarize the number of flights by minutes delayed.

my_data %>% group_by(year, month, day) %>% summarize(mean = mean(departure_time, na.rm = TRUE))

Tibbles

library(tibble)

Now we will take the time to explore tibbles. Tibbles are modified dataframes which tweak some

of the older features from data frames. R is an old language and useful things from 20 years

ago are not as useful anymore

as_tibble(iris)

As we can see, we have the same data frame, but we have different features

you can also create a tibble form scratch with tibble(

tibble( x = 1:5, y = 1, z = x ^ 2 + y )

you can also use tribble() for basic data table creation

tribble( ~genea, ~geneb, ~genec, ######################################## 110, 112, 114, 6, 5, 4 )

Tibbles are built to not overwhelm your console when printing data, only showing

the first few lines.

This is how a data frame prints

print(by_day) as.data.frame(by_day) head(by_day)

nycflights13::flights %>% print(n=10, width=Inf)

subsetting

subsetting tibbles is easy, similar to data.frames

df_tibble <- tibble(nycflights13::flights)

df_tibble

we can subset by column name using the $

df_tibble$carrier

we can subset by position using [[]]

df_tibble[[2]]

if you want to use this in a pipe, you need to use the “.” placeholder

df_tibble %>% .$carrier

some older functions do not like tibbles, thus you might have to convert them back to dataframe

class(df_tibble)

df_tibble_2 <- as.data.frame(df_tibble)

class(df_tibble_2)

df_tibble

head(df_tibble_2)

tidyr

library(tidyverse)

How do we make a tidy data set? well the tidyverse follows three rules

1 - each variable must have its own column. 2 - each observation has its own row. 3 - each value has its own cell.

It is impossible to satisfy two of the three rules

this leads to the following instructions for tidy data

1 put each data set into a tibble 2 put each variable into a column 3 profit

picking one consistent method of data storage makes for easier understanding of your code and what is happening “under the hood” or behind the scenes.

lets now look at working with tibbles

bmi <- tibble (women)

bmi %>% mutate(bmi = (703 * weight)/(height)^2)

spreading and gathering

sometimes you’ll find datasets that don’t fit well into a tibble

we’ll use the built-in data from tidyverse for this part

table4a

as you can see from this data, we have one variabel in column A (country) # but column b and c are two of the same. Thus, there are two observations in # each row.

To fix this, we can use the gather function

table4a %>% gather(‘1999’, ‘2000’, key = ‘year’, value = ‘cases’)

lets look at another example

table4b

as you can see we have the same problem in table 4b

table4b %>% gather(“1999”, “2000”, key = “year”, value = “population”)

Now what if we want to join these two tables? we can use dplyr

table4a <- table4a %>% gather(‘1999’, ‘2000’, key = ‘year’, value = ‘cases’) table4b <- table4b %>% gather(“1999”, “2000”, key = “year”, value = “population”)

left_join(table4a, table4b)

Spreading

Spreading is is the opposite of gathering. Lets look at table 2

table2

you can see that we have redundant info in columns 1 and 2

we can fix that by combining rows 1&2, 3&4 etc.

spread(table2, key = type, value = count)

Type is the key of what we are turning into columns the value is what becomes rows

In summary, spread makes long tables shorter and wider

gather makes wide tables, narrower and longer.

separating and pull

now what happens when we have two observations stuck in one column?

table3

as you can see, the rate is just the population and cases combined.

we can use separate to fix this

table3 %>% separate(rate, into = c(“cases”, “population”))

however, if you notice, the column type is not correct.

table3 %>% separate(rate, into = c(“cases”, “populate”), conver = TRUE)

you can specify what you want to separate based on.

table3 %>% separate(rate, into = c(“cases”, “populate”), sep = “/”, conver = TRUE)

Lets make this look more tidy

table3 %>% separate( year, into = c(“century”, “year”), convert = TRUE, sep = 2 )

Unite

what happens if we want to do the inverse of separate?

table5

table5 %>% unite(date, century, year)

table5 %>% unite(date, century, year, sep = ““)

Missing values

There can be two types of missing vales. NA (explicit) or just no entry(implicit)

gene_data <- tibble( gene = c(‘a’, ‘a’, ‘a’, ‘a’, ‘b’, ‘b’, ‘b’), nuc = c(20, 22, 24, 25, NA, 42, 67), run = c(1,2,3,4,2,3,4) )

gene_data

The nucleotide count for Gene b run 2 is explicit missing.

The nucleotide count for gene b run 1 is implicitly missing.

one way we can make implicit missing values explicit is by putting runs in columns.

gene_data %>% spread(gene, nuc)

if we want to remove the missing values, we can use spread and gather, and na.rm = TRUE

gene_data %>% spread(gene, nuc) %>% gather(gene, nuc, ‘a’:‘b’, na.rm = TRUE)

Another way that we can make implicit values explicit is complete()

gene_data %>% complete(gene, run)

Sometimes an NA is present to represent a value being carried forward

treatment <-tribble( ~person, ~treatment, ~response, ############################################# “Issac”, 1, 7,
NA, 2, 10, NA, 3, 9, “VDB”, 1, 8, NA, 2, 11, NA, 3, 10, )

treatment

what we can do here is use the fill() option

treatment %>% fill(person)

DPLYR

It is rare that you will be working with a single data table. The DPLYR package allows you to join teo data tables based on common values

mutate joins- add new variable to one data frame from the matching observations in another

Filtering joins- filters observation from one data frame based on whether or not

    # they are present in another

set operations- treats observations as they are set elements

library(tidyverse) library(nycflights13)

lets pull you full carrier names based on letter codes

airlines

lets get info about airports

airports

lets get info about each plane

planes

lets get some info on the weather at airports

weather

lets get info on singular flights

flights

lets look at how these tables connect

flights -> planes based on tail number

flights -> airlines through carrier

flights -> airports origin and destination

flights -> weather via origin, year/monthday/hour

keys

keys are uniwue identifiers per observation primary key uniquely identifies an observation in its own table.

one way to identify a primary key is as follows

planes %>% count(tailnum) %>% filter(n>1)

this indicated that the tail number is unique

planes %>% count(model) %>% filter(n>1)

mutate join

flights2 <- flights %>% select(year:day, hour, origin, dest, tailnum, carrier)

flights2

flights2 %>% select(-origin, -dest) %>% left_join(airlines, by = ‘carrier’)

we’ve now added the airline name to your data frame from the airline data frame

other types of joins

inner joins (inner_join) matches a pair of observations when their key is equal

outer joins (outer_join) keeps observations that appear in atleast one table

stringr

library(tidyverse) library(stringr)

you can create strings using single or double quotes

string1 <- “this is a string” string2 <- “to put a quote in your string, use the opposite”

string1 string2

if you forget to close your string, you’ll get this:

string3 <- “where is this string going?”

string3

just hit escape and try again

multiple strings are stored in character vectors

string4 <- c(“one”, “two”, “three”) string4

measuring string length

str_length(string3)

tr_length(string4)

lets combine two strings

str_c(“X”, “Y”)

str_c(string1, string2)

you can use sep to control how they are separated

str_c(string1, string2, sep = ” “)

str_c(“x”, “y”, “z”, sep = “_“)

subsetting strings

you can subset a string using str_sub()

HSP <- c(“HSP123”, “HSP234”, “HSP456”)

str_sub(HSP, 4,6)

this just drop the first four letters from the strings

or you can use negative to count back from the end

str_sub(HSP, -3, -1)

you can convert the cases of strings like follows

HSP str_to_lower(HSP)

str_to_upper()

regular expression

install.packages(“htmlwidgets”)

x <- c(‘ATTAGA’, ‘CGCCCCCGGAT’, ‘TATTA’)

str_view(x, “G”)

str_view(x, “TA”)

the next step is “.” where the “.” matches an entry

str_view(x, “.G.”)

anchors allow you to mactch at the start or the ending

str_view(x, “^TA”)

str_view(x, “TA$”)

character classes/alternatives

you can also use | to pick between two alternatives str_view(x, “TA[G|T]”)

Detect Matches

str_detect() returns a logical vextor the same length of input

y <- c(“apple”, “banana”, “pear”) y

str_detect(y, “e”)

How many commin words contain letter e

words

sum(str_detect(words, “e”))

lets get more complex, what proption words in a vowel

mean(str_detect(words, “[aeiou]$”))

mean (str_detect(words, “¹”))

lets find all the words that don’t contain “o” or “u”

no_o <- !str_detect(words, “[ou]”)

no_o

now lets extract

words[!str_detect(words, “[ou]”)]

you can also use str_count() to day how many matches there are in string

str_count(x, “[GC]”)

lets couple this with mutate

df <- tibble( word = words, count = seq_along(word) )

df %>% mutate( vowels = str_count(words, “[aeiou]”), constonants = str_count(words, “[^aeiou]”) )

Microarrays

Limma Microarray Analysis

setwd(“~/Desktop/classroom/myfiles”) #setwd(“~/Desktop/classroom/myfiles”)

library(ath1121501cdf) library(ath1121501.db) library(annotate) library(affy) library(limma) library(oligo) library(dplyr) library(stats) library(reshape)

Read the cell files into the directory

targets <- readTargets(“Bric16_Targets.csv”, sep = “,”, row.names = “filenames”)

ab <- ReadAffy() ########################################## hist(ab)

Normalizing the microarray experiment

eset <- affy::rma(ab)

pData(eset)

Lets visualize the normalize

hist(eset)

lets characterize the data a bit

ID <- featureNames(eset) Symbol <- getSYMBOL(ID, “ath1121501.db”)

affydata <- read.delim(“affy_ATH1_array_elements.txt”)

Differential gene expression analysis

Flight vs Ground

condition <- targets$gravity

design <- model.matrix(~factor(condition)) colnames(design) <- c(“flight”, “ground”)

fit <- lmFit(eset, design) fit <- eBayes(fit) options(digits = 2) top <- topTable(fit,coef = 2, n = Inf, adjust = ‘fdr’)

Lets combine annotations

Annot <- data.frame(GENE = sapply(contents(ath1121501GENENAME), paste, collapse = “,”), KEGG = sapply(contents(ath1121501PATH), paste, collapse = “,”), GO = sapply(contents(ath1121501GO), paste, collapse = “,”), SYMBOL = sapply(contents(ath1121501SYMBOL), paste, collapse = “,”), ACCNUM = sapply(contents(ath1121501ACCNUM), paste, collapse = “,”))

lets merge all the data into one data frame

all <- merge(Annot, top, by.x = 0, by.y = 0, all = TRUE)

all2 <- merge(all, affydata, by.x = “Row.names”, by.y = “array_element_name”)

lets convert the ACCNUM to a character value

all2$ACCNUM <- as.character(all2$ACCNUM)

write.table(all2, file = “BRIC_16_Final.csv”, row.names = TRUE, col.names = TRUE, sep = “)

columns(org.At.tair.db)

foldchanges <- as.data.frame(all2$logFc)

all2$entrez = mapIds(org.At.tair.db, keys = all2$ACCNUM, column = “ENTREZID”, keytype = “TAIR”, multiVals = “first”)

head(all2, 10)

Pathview

library(pathview) library(gage) library(gageData) data(kegg.sets.hs)

foldchanges = all2$logFC names (foldchanges) = all2$entrez

head(foldchanges)

kegg.ath = kegg.gsets(“ath”, id.type = “entrez”) kegg.ath.sigmet = kegg.ath$kg.sets[kegg.ath$sigmet.idx]

lets get the results

keggres = gage(foldchanges, gsets = kegg.ath.sigmet, same.dir = TRUE)

lapply(keggres, head)

greater <- keggres$greater lessers <- keggres$less

write.csv(greater, file = “BRIC_16_pathview_Greater_Pathways.csv”) write.csv(lessers, file = “BRIC_16_pathview_lesser_Pathways.csv”)

lets map to pathways (greater)

keggrespathways = data.frame(id=rownames(keggres$greater), keggres$greater) %>% tbl_df() %>% filter(row_number()<=5) %>% .$id %>% as.character() keggrespathways

keggressids_greater = substr(keggrespathways, start=1, stop=8) keggressids_greater

genedata <- as.character(all2$logFC)

define a plotting function to apply next

plot_pathway = function(pid) pathview(gene.data=foldchanges, pathway.id = pid, species = “ath”, new.signature = FALSE)

plot multiple pathways (plots are saved to disk and returns a throwaway object list)

tmp = sapply(keggressids_greater, function(pid) pathview(gene.data=foldchanges, pathway.id = pid, species = “ath”))

lets map pathways (lesser)

keggrespathways = data.frame(id=rownames(keggres$less), keggres$less) %>% tbl_df() %>% filter(row_number()<=5) %>% .$id %>% as.character() keggrespathways

keggressids_less = substr(keggrespathways, start=1, stop=8) keggressids_less

genedata <- as.character(all2$logFC)

define a plotting function to apply next

plot_pathway = function(pid) pathview(gene.data=foldchanges, pathway.id = pid, species = “ath”, new.signature = FALSE)

plot multiple pathways (plots are saved to disk and returns a throwaway object list)

tmp = sapply(keggressids_less, function(pid) pathview(gene.data=foldchanges, pathway.id = pid, species = “ath”))

author: Lauren Green date: 5/20/2022 output: html_document —

Lets load the required libraries for this analysis

library("DESeq2")

## Loading required package: S4Vectors

## Loading required package: stats4

## Loading required package: BiocGenerics

## 
## Attaching package: 'BiocGenerics'

## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs

## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
##     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
##     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
##     union, unique, unsplit, which.max, which.min

## 
## Attaching package: 'S4Vectors'

## The following objects are masked from 'package:base':
## 
##     expand.grid, I, unname

## Loading required package: IRanges

## Loading required package: GenomicRanges

## Loading required package: GenomeInfoDb

## Loading required package: SummarizedExperiment

## Loading required package: MatrixGenerics

## Loading required package: matrixStats

## 
## Attaching package: 'MatrixGenerics'

## The following objects are masked from 'package:matrixStats':
## 
##     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
##     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
##     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
##     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
##     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
##     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
##     colWeightedMeans, colWeightedMedians, colWeightedSds,
##     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
##     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
##     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
##     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
##     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
##     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
##     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
##     rowWeightedSds, rowWeightedVars

## Loading required package: Biobase

## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.

## 
## Attaching package: 'Biobase'

## The following object is masked from 'package:MatrixGenerics':
## 
##     rowMedians

## The following objects are masked from 'package:matrixStats':
## 
##     anyMissing, rowMedians

library("dplyr")

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:Biobase':
## 
##     combine

## The following object is masked from 'package:matrixStats':
## 
##     count

## The following objects are masked from 'package:GenomicRanges':
## 
##     intersect, setdiff, union

## The following object is masked from 'package:GenomeInfoDb':
## 
##     intersect

## The following objects are masked from 'package:IRanges':
## 
##     collapse, desc, intersect, setdiff, slice, union

## The following objects are masked from 'package:S4Vectors':
## 
##     first, intersect, rename, setdiff, setequal, union

## The following objects are masked from 'package:BiocGenerics':
## 
##     combine, intersect, setdiff, union

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library("apeglm")

lets load in our data

countData <- read.csv("GLDS-102_rna_seq_Unnormalized_Counts.csv")

## Warning in file(file, "rt"): cannot open file
## 'GLDS-102_rna_seq_Unnormalized_Counts.csv': No such file or directory

## Error in file(file, "rt"): cannot open the connection

colData <- read.csv("PHENO_DATA_Mouse.csv", sep = ",", row.names = 1)

## Warning in file(file, "rt"): cannot open file 'PHENO_DATA_Mouse.csv': No such
## file or directory

## Error in file(file, "rt"): cannot open the connection

Now we need to add leveling factors to the colData

colData$group <- factor(colData$group, levels = c("0", "1"))

## Error in colData$group: object of type 'closure' is not subsettable

Now lets make sure we have the same number of individuals as well as groups

all(rownames(colData)) %in% colnames(countData)

## Error in h(simpleError(msg, call)): error in evaluating the argument 'table' in selecting a method for function '%in%': error in evaluating the argument 'x' in selecting a method for function 'colnames': object 'countData' not found

we need to round up the counts, because DESeq2 only allows integers as an input, and not fractional numbers. This depends on the method of alignment that was used upstream.

countData %>%
  mutate_if(is.numeric, ceiling)

## Error in is_grouped_df(tbl): object 'countData' not found

countData[, 2:13] <- sapply(countData[, 2:13], as.integer)

## Error in h(simpleError(msg, call)): error in evaluating the argument 'X' in selecting a method for function 'sapply': object 'countData' not found

row.names(countData) <- countData [, 1]

## Error in eval(expr, envir, enclos): object 'countData' not found

countData <- countData[2:13]

## Error in eval(expr, envir, enclos): object 'countData' not found

rownames(colData) == colnames(countData)

## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'colnames': object 'countData' not found

dds <- DESeqDataSetFromMatrix(countData = countData, colData = colData, design = ~group)

## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'ncol': object 'countData' not found

dds <- dds[rowSums(counts(dds)>2) >=4]

## Error in eval(expr, envir, enclos): object 'dds' not found

dds <- DESeq(dds)

## Error in is(object, "DESeqDataSet"): object 'dds' not found

res <- results(dds)

## Error in is(object, "DESeqDataSet"): object 'dds' not found

resLFC <- lfcShrink(dds, coef= 2)

## Error in is(dds, "DESeqDataSet"): object 'dds' not found

(resOrdered <- res[order(res$padj), ])

## Error in eval(expr, envir, enclos): object 'res' not found

summary(res)

## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'summary': object 'res' not found

FLT_VS_GC <- as.data.frame(res$log2FoldChange)

## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': object 'res' not found

head(FLT_VS_GC)

## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'head': object 'FLT_VS_GC' not found

plotMA(resLFC, ylim = c(-2,2))

## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'plotMA': object 'resLFC' not found

pdf(file = "MA_Plot_FLT_vs_GC.pdf")

dev.off()

## png 
##   2

Lets save our differential expression results to file.

write.csv(as.data.frame(resOrdered), file = "Mouse_DESeq.csv")

## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': object 'resOrdered' not found

Lets perform a custom transformation

dds <- estimateSizeFactors(dds)

## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'estimateSizeFactors': object 'dds' not found

se <- SummarizedExperiment(log2(counts(dds, normalize = TRUE) +1), colData = colData(dds))

## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'counts': object 'dds' not found

pdf(file = "PCA_PLOT_FLT_vs_GC.pdf")

plotPCA(DESeqTransform(se), intgroup = "group")

## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'plotPCA': object 'se' not found

lets load our required packages

library(AnnotationDbi)

## 
## Attaching package: 'AnnotationDbi'

## The following object is masked from 'package:dplyr':
## 
##     select

library(org.Mm.eg.db)

##

columns(org.Mm.eg.db)

##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
##  [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
## [11] "GENETYPE"     "GO"           "GOALL"        "IPI"          "MGI"         
## [16] "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"        
## [21] "PROSITE"      "REFSEQ"       "SYMBOL"       "UNIPROT"

foldchanges <- as.data.frame(res$log2FoldChange, row.names = row.names(res))

## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': object 'res' not found

head(foldchanges)

## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'head': object 'foldchanges' not found

res$symbol = mapIds(org.Mm.eg.db,
                    keys = row.names(res),
                    column = "SYMBOL",
                    keytype = "ENSEMBL",
                    multiVals = "first")

## Error in row.names(res): object 'res' not found

res$entrez = mapIds(org.Mm.eg.db,
                    keys = row.names(res),
                    column = "ENTREZID",
                    keytype = "ENSEMBL",
                    multiVals = "first")

## Error in row.names(res): object 'res' not found

res$name = mapIds(org.Mm.eg.db,
                    keys = row.names(res),
                    column = "GENENAME",
                    keytype = "ENSEMBL",
                    multiVals = "first")

## Error in row.names(res): object 'res' not found

head(res, 10)

## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'head': object 'res' not found

Lets load our pathview packages

library(pathview)

##

## ##############################################################################
## Pathview is an open source software package distributed under GNU General
## Public License version 3 (GPLv3). Details of GPLv3 is available at
## http://www.gnu.org/licenses/gpl-3.0.html. Particullary, users are required to
## formally cite the original Pathview paper (not just mention it) in publications
## or products. For details, do citation("pathview") within R.
## 
## The pathview downloads and uses KEGG data. Non-academic uses may require a KEGG
## license agreement (details at http://www.kegg.jp/kegg/legal.html).
## ##############################################################################

library(gage)

##

library(gageData)
data(kegg.sets.mm)
data(go.subs.mm)

foldchanges <- res$log2FoldChange

## Error in eval(expr, envir, enclos): object 'res' not found

names(foldchanges) = res$entrez

## Error in eval(expr, envir, enclos): object 'res' not found

head(foldchanges)

## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'head': object 'foldchanges' not found

kegg.mm = kegg.gsets("mouse", id.type = "entrez")
kegg.mm.sigmet <- kegg.mm$kg.sets[kegg.mm$sigmet.idx]

lets map the resulta

keggres <- gage(foldchanges, gsets = kegg.mm.sigmet, same.dir = TRUE)

## Error in saaPrep(exprs, ref = ref, samp = samp, same.dir = same.dir, compare = compare, : object 'foldchanges' not found

lapply(keggres, head)

## Error in h(simpleError(msg, call)): error in evaluating the argument 'X' in selecting a method for function 'lapply': object 'keggres' not found

Lets save our greater and less than pathways

greaters <- keggres$greater

## Error in eval(expr, envir, enclos): object 'keggres' not found

lessers <- keggres$less

## Error in eval(expr, envir, enclos): object 'keggres' not found

keggrespathways <- data.frame(id = rownames(keggres$greater), keggres$greater) %>%
  tbl_df() %>%
  filter(row_number() <=3) %>%
  .$id %>%
  as.character()

## Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
## Please use `tibble::as_tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'rownames': object 'keggres' not found

keggrespathways

## Error in eval(expr, envir, enclos): object 'keggrespathways' not found

keggresids <- substr(keggrespathways, start = 1, stop = 8)

## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'substr': object 'keggrespathways' not found

keggresids

## Error in eval(expr, envir, enclos): object 'keggresids' not found

PLOT!

plot_pathway = function(pid) pathview(gene.data = foldchange, pathway.id = pid, species = "mouse", new.signature = FALSE)

tmp = sapply(keggresids, function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = "mouse"))

## Error in h(simpleError(msg, call)): error in evaluating the argument 'X' in selecting a method for function 'sapply': object 'keggresids' not found

keggrespathways <- data.frame(id = rownames(keggres$less), keggres$less) %>%
  tbl_df() %>%
  filter(row_number() <=3) %>%
  .$id %>%
  as.character()

## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'rownames': object 'keggres' not found

keggrespathways

## Error in eval(expr, envir, enclos): object 'keggrespathways' not found

keggresids <- substr(keggrespathways, start = 1, stop = 8)

## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'substr': object 'keggrespathways' not found

keggresids

## Error in eval(expr, envir, enclos): object 'keggresids' not found

PLOT!

plot_pathway = function(pid) pathview(gene.data = foldchange, pathway.id = pid, species = "mouse", new.signature = FALSE)

tmp = sapply(keggresids, function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = "mouse"))

## Error in h(simpleError(msg, call)): error in evaluating the argument 'X' in selecting a method for function 'sapply': object 'keggresids' not found

library(imager)

## Loading required package: magrittr

## 
## Attaching package: 'imager'

## The following object is masked from 'package:magrittr':
## 
##     add

## The following objects are masked from 'package:SummarizedExperiment':
## 
##     resize, width

## The following object is masked from 'package:Biobase':
## 
##     channel

## The following objects are masked from 'package:GenomicRanges':
## 
##     resize, width

## The following objects are masked from 'package:IRanges':
## 
##     resize, width

## The following object is masked from 'package:S4Vectors':
## 
##     width

## The following object is masked from 'package:BiocGenerics':
## 
##     width

## The following objects are masked from 'package:stats':
## 
##     convolve, spectrum

## The following object is masked from 'package:graphics':
## 
##     frame

## The following object is masked from 'package:base':
## 
##     save.image

file.names <- list.files(path = "~desktop/classroom/myfiles", pattern = ".*pathview.png")

all_images <- lapply(filenames, load.image)

## Error in h(simpleError(msg, call)): error in evaluating the argument 'X' in selecting a method for function 'lapply': object 'filenames' not found

knitr::include_graphics(filenames)

## Error in Encoding(x): object 'filenames' not found

author: Lauren Green date: “5/20/2022” output: html_document —

library("edgeR")

## Loading required package: limma

## 
## Attaching package: 'limma'

## The following object is masked from 'package:DESeq2':
## 
##     plotMA

## The following object is masked from 'package:BiocGenerics':
## 
##     plotMA

library("dplyr")
library("AnnotationDbi")
library("org.Mm.eg.db")

rawdata = read.csv("GLDS-102_rna_seq_Normalized_Counts.csv", row.names = "gene_id")

## Warning in file(file, "rt"): cannot open file
## 'GLDS-102_rna_seq_Normalized_Counts.csv': No such file or directory

## Error in file(file, "rt"): cannot open the connection

group <- factor(c("1", '1', '1', '1', '1', '1', '2', '2', '2', '2', '2', '2'))

dgeGlm <- DGEList(counts = rawdata, group = group)

## Error in is.data.frame(counts): object 'rawdata' not found

str(dgeGlm)

## Error in str(dgeGlm): object 'dgeGlm' not found

keep <- rowSums(cpm(dgeGlm)>2) >=4

## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'rowSums': object 'dgeGlm' not found

dgeGlm <- dgeGlm[keep,]

## Error in eval(expr, envir, enclos): object 'dgeGlm' not found

names(dgeGlm)

## Error in eval(expr, envir, enclos): object 'dgeGlm' not found

dgeGlm[["samples"]]

## Error in eval(expr, envir, enclos): object 'dgeGlm' not found

design <- model.matrix(~group)
design

##    (Intercept) group2
## 1            1      0
## 2            1      0
## 3            1      0
## 4            1      0
## 5            1      0
## 6            1      0
## 7            1      1
## 8            1      1
## 9            1      1
## 10           1      1
## 11           1      1
## 12           1      1
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"

dgeGlmComDisp <- estimateGLMCommonDisp(dgeGlm, design, verbose = TRUE)

## Error in estimateGLMCommonDisp(dgeGlm, design, verbose = TRUE): object 'dgeGlm' not found

dgeGlmTrendDisp <- estimateGLMTrendedDisp(dgeGlmComDisp, design)

## Error in estimateGLMTrendedDisp(dgeGlmComDisp, design): object 'dgeGlmComDisp' not found

dgeGlmTagDisp <- estimateGLMTagwiseDisp(dgeGlmTrendDisp, design)

## Error in estimateGLMTagwiseDisp(dgeGlmTrendDisp, design): object 'dgeGlmTrendDisp' not found

plotBCV(dgeGlmTagDisp)

## Error in is(y, "DGEList"): object 'dgeGlmTagDisp' not found

fit <- glmFit(dgeGlmTagDisp, design)

## Error in glmFit(dgeGlmTagDisp, design): object 'dgeGlmTagDisp' not found

colnames(coef(fit))

## NULL

lrt <- glmLRT(fit, coef =2)

## Error in glmLRT(fit, coef = 2): glmfit must be an DGEGLM object (usually produced by glmFit).

ttGlm <- topTags(lrt, n = Inf)

## Error in topTags(lrt, n = Inf): object 'lrt' not found

class(ttGlm)

## Error in eval(expr, envir, enclos): object 'ttGlm' not found

summary(deGlm <- decideTestsDGE(lrt, p = 0.1, adjust = "fdr"))

## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'summary': object 'lrt' not found

tagsGlm <- rownames(dgeGlmTagDisp)[as.logical(deGlm)]

## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'rownames': object 'dgeGlmTagDisp' not found

plotSmear(lrt, de.tags = tagsGlm)

## Error in plotSmear(lrt, de.tags = tagsGlm): object 'lrt' not found

hits2 <- ttGlm$table[ttGlm$table$FDR < 0.1, ]

## Error in eval(expr, envir, enclos): object 'ttGlm' not found

write.csv(hits2, "mouse_edgeR_results_Zero_vs_1.csv")

## Error in is.data.frame(x): object 'hits2' not found

columns(org.Mm.eg.db)

##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
##  [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
## [11] "GENETYPE"     "GO"           "GOALL"        "IPI"          "MGI"         
## [16] "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"        
## [21] "PROSITE"      "REFSEQ"       "SYMBOL"       "UNIPROT"

ttGlm2 <- as.data.frame(ttGlm$table)

## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': object 'ttGlm' not found

ttGlm$symbol = mapIds(org.Mm.eg.db, 
                      keys=row.names(ttGlm2), 
                      column = "SYMBOL", 
                      keytype = "ENSEMBL", 
                      multiVals = "first")

## Error in row.names(ttGlm2): object 'ttGlm2' not found

ttGlm2$entrez = mapIds(org.Mm.eg.db, 
                         keys=row.names(ttGlm2), 
                      column = "ENTREZID", 
                      keytype = "ENSEMBL", 
                      multiVals = "first")

## Error in row.names(ttGlm2): object 'ttGlm2' not found

ttGlm2$name = mapIds(org.Mm.eg.db, 
                         keys=row.names(ttGlm2), 
                      column = "GENENAME", 
                      keytype = "ENSEMBL", 
                      multiVals = "first")

## Error in row.names(ttGlm2): object 'ttGlm2' not found

head(ttGlm2)

## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'head': object 'ttGlm2' not found

library(pathview)
library(gage)
library(gageData)
data(kegg.sets.mm)
data(go.subs.mm)

foldchanges <- ttGlm2$logFC

## Error in eval(expr, envir, enclos): object 'ttGlm2' not found

names(foldchanges) <- ttGlm2$entrez

## Error in eval(expr, envir, enclos): object 'ttGlm2' not found

kegg.mm = kegg.gsets("mouse", id.type = "entrez")
kegg.mm.sigmet = kegg.mm$kg.sets [kegg.mm$sigmet.idx]

# lets get the results
keggres = gage(foldchanges, gsets = kegg.mm.sigmet, smae.dir = TRUE)

## Error in saaPrep(exprs, ref = ref, samp = samp, same.dir = same.dir, compare = compare, : object 'foldchanges' not found

lapply(keggres, head)

## Error in h(simpleError(msg, call)): error in evaluating the argument 'X' in selecting a method for function 'lapply': object 'keggres' not found

greaters <- keggres$greater

## Error in eval(expr, envir, enclos): object 'keggres' not found

lessers <- keggres$lessers

## Error in eval(expr, envir, enclos): object 'keggres' not found

{

keggrespathways = data.frame(id = rownames(keggres$greater), keggres$greater) %>%
  tbl_df() %>%
  filter(row_number()<=5) %>%
  .$id %>%
  as.character()

## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'rownames': object 'keggres' not found

keggrespathways

## Error in eval(expr, envir, enclos): object 'keggrespathways' not found

keggresids = substr(keggrespathways, start=1, stop = 8)

## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'substr': object 'keggrespathways' not found

keggresids

## Error in eval(expr, envir, enclos): object 'keggresids' not found

plot_pathway = function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = "mouse", new.signature = FALSE)

# plot multiple pathways
tmp = sapply(keggresids, function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = "mouse"))

## Error in h(simpleError(msg, call)): error in evaluating the argument 'X' in selecting a method for function 'sapply': object 'keggresids' not found

keggrespathways = data.frame(id = rownames(keggres$less), keggres$less) %>%
  tbl_df() %>%
  filter(row_number()<=5) %>%
  .$id %>%
  as.character()

## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'rownames': object 'keggres' not found

keggrespathways

## Error in eval(expr, envir, enclos): object 'keggrespathways' not found

keggresids = substr(keggrespathways, start=1, stop = 8)

## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'substr': object 'keggrespathways' not found

keggresids

## Error in eval(expr, envir, enclos): object 'keggresids' not found

plot_pathway = function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = "mouse", new.signature = FALSE)

# plot multiple pathways
tmp = sapply(keggresids, function(pid) pathview(gene.data = foldchanges, pathway.id = pid, species = "mouse"))

## Error in h(simpleError(msg, call)): error in evaluating the argument 'X' in selecting a method for function 'sapply': object 'keggresids' not found

library(imager)


filenames <- list.files(path = "~/Desktop/classroom/myfiles", pattern= ".*pathview.png")

all_images <- lapply(filenames, load.image)

## Error in wrap.url(file, load.image.internal): File not found

knitr::include_graphics(filenames)

## Error in knitr::include_graphics(filenames): Cannot find the file(s): "ath00020.pathview.png"; "ath00071.pathview.png"; "ath00195.pathview.png"; "ath00592.pathview.png"; "ath00906.pathview.png"; "ath01230.pathview.png"; "ath03010.pathview.png"; "ath04016.pathview.png"; "ath04141.pathview.png"; "ath04145.pathview.png"; "mmu00910.pathview.png"; "mmu00970.pathview.png"; "mmu02010.pathview.png"; "mmu03008.pathview.png"; "mmu03010.pathview.png"; "mmu03013.pathview.png"; "mmu04022.pathview.png"; "mmu04110.pathview.png"; "mmu04145.pathview.png"; "mmu04217.pathview.png"; "mmu04330.pathview.png"; "mmu04350.pathview.png"; "mmu04360.pathview.png"; "mmu04390.pathview.png"; "mmu04550.pathview.png"; "mmu04657.pathview.png"; "mmu04714.pathview.png"

library(tidyverse)

EdgeR <- read.csv(“mouse_edgeR_results_Zero_vs_1.csv”)

DESeq <- read.csv(“mouse_DESeq.csv”)

DESeq2 <- DESeq %>% filter(padj <0.1)

DESeq2 <- DESeq2[,c(1,3)]

EdgeR<-EdgeR[, 1:2]

colnames(DESeq2) <- c(“ID”, “logFC”) colnames(EdgeR) <- c(“ID”, “logFC”)

install.packages(“GOplot”) library(GOplot)

comp <- GOVenn(DESeq2, EdgeR, label = c(“DESeq1”, “EdgeR”), title = “Comparison of DESeq and EdgeR DE Genes”, plot = FALSE)

comp$plot

comp$table

library(msa)

seq <- readAAStringSet(“hglobin.fa”) # seq <- readAAStringSet(“hglobin.fa”)

seq

lets align the 8 different amino acid sequences

alignments <- msa(seq, method = “ClustalW”) alignments

msaPrettyPrint(alignments, output = “pdf”, showNames = “left”, showLogo = “none”, askForOverwrite = FALSE, verbose = TRUE, file = “whole_align.pdf”)

msaPrettyPrint(alignments, c(10,30), output = “pdf”, showNames = “left”, file = “Zoomed_align.pdf”, showLogo = “top”, askForOverwrite = FALSE, verbose = TRUE)

Lets make a tree from our alignment

library(ape) library(seqinr)

convert to seqinr alignment _> get the distances and make a tree

alignment_seqninr <- msaConvert(alignments, type = “seqinr::alignment”)

distances1 <- seqinr::dist.alignment(alignment_seqninr, “identity”)

tree <- ape::nj(distances1) plot(tree, main = “Phylogenetic Tree of HBA Sequences”)

library(DECIPHER)

In the first step, we load the libraries and the sequence into long_seqs

This is a DNAStringSet object

~Desktop/classroom/myfiles

long_seg <- readDNAStringSet(“plastid_genomes.fa”) long_seg

Now lets build a temporary SQLite database

Seqs2DB(long_seq, “XStringSet”, “long_db”, names(long_seq))

Now that we’ve built a database, we can do the following:

find the syntenic blocks

synteny <- FindSynteny(“long_db”)

#view blocs with plotting pairs(synteny) plot(synteny)

make an actual alignment file

alignment <- AlignSynteny(synteny, “long_db”)

lets create a structure holding all aligned syntenic blocks for a pair of sequences

block <- unlist(alignment[[1]])

we can write to file one alignment at a time

writeXStringSet(block, “genome_blocks_out.fa”)

library(DECIPHER)

In the first step, we load the libraries and the sequence into long_seqs

This is a DNAStringSet object

~Desktop/classroom/myfiles

long_seg <- readDNAStringSet(“plastid_genomes.fa”) long_seg

Now lets build a temporary SQLite database

Seqs2DB(long_seq, “XStringSet”, “long_db”, names(long_seq))

Now that we’ve built a database, we can do the following:

find the syntenic blocks

synteny <- FindSynteny(“long_db”)

#view blocs with plotting pairs(synteny) plot(synteny)

make an actual alignment file

alignment <- AlignSynteny(synteny, “long_db”)

lets create a structure holding all aligned syntenic blocks for a pair of sequences

block <- unlist(alignment[[1]])

we can write to file one alignment at a time

writeXStringSet(block, “genome_blocks_out.fa”)

#Quantifying difference between trees with treespace

First lets load the required packages

library(ape) library(adegraphics) library(treespace)

Now we need to load all the treefiles into a multiPhylo object

treefiles <- list.files(file.path(getwd(), “gene_trees”), full.names = TRUE) tree_list <- lapply(treefiles, read.tree) class(tree_list)

class(tree_list) <- “multiPhylo”

class(tree_list)

# Now we can compute the Kendall-Coljin distances between tress. This function # does a LOT of analyses.

# First it runs a pairwise comparison of all trees in the input # Second it carrier out clustering using PCA # These results are returned in a list of objects, where $D contains the pairwise # metric of the trees, and $pco contains the PCA. The method we use (Kendal-Coljin) # particularly sutiable for rooted trees as we have here. The option NF tells us how # many principal components to retain.

comparisons <- treespace(tree_list, nf = 3)

aeiou↩︎

Final Project

Lauren Green

4/11/2022

Section Headers

First level header

Second level header

Third level header

Block Quotes

Hyperlinks

Formulas

Code Chunks

Code chunk options

Table of Contents

Tabs

Themes

Code Folding

Summary

Data Wrangling with R

Arrange

select

renaming

mutate

summarize and by_group()

So we can see here that the average delay is about 12 minutes

we gain additional value in summarize by pairing it with by_group()

Missing Data

counts

Piping

Tibbles

Now we will take the time to explore tibbles. Tibbles are modified dataframes which tweak some

of the older features from data frames. R is an old language and useful things from 20 years

ago are not as useful anymore

As we can see, we have the same data frame, but we have different features

you can also create a tibble form scratch with tibble(

you can also use tribble() for basic data table creation

Tibbles are built to not overwhelm your console when printing data, only showing

the first few lines.

This is how a data frame prints

subsetting

we can subset by column name using the $

we can subset by position using [[]]

if you want to use this in a pipe, you need to use the “.” placeholder

some older functions do not like tibbles, thus you might have to convert them back to dataframe

tidyr

spreading and gathering

To fix this, we can use the gather function

lets look at another example

as you can see we have the same problem in table 4b

Now what if we want to join these two tables? we can use dplyr

Spreading

Spreading is is the opposite of gathering. Lets look at table 2

you can see that we have redundant info in columns 1 and 2

we can fix that by combining rows 1&2, 3&4 etc.

Type is the key of what we are turning into columns the value is what becomes rows

In summary, spread makes long tables shorter and wider

gather makes wide tables, narrower and longer.

separating and pull

now what happens when we have two observations stuck in one column?

as you can see, the rate is just the population and cases combined.

we can use separate to fix this

however, if you notice, the column type is not correct.

you can specify what you want to separate based on.

Lets make this look more tidy

Unite

Missing values

The nucleotide count for Gene b run 2 is explicit missing.

The nucleotide count for gene b run 1 is implicitly missing.

one way we can make implicit missing values explicit is by putting runs in columns.

if we want to remove the missing values, we can use spread and gather, and na.rm = TRUE

Another way that we can make implicit values explicit is complete()

Sometimes an NA is present to represent a value being carried forward

what we can do here is use the fill() option

DPLYR

mutate joins- add new variable to one data frame from the matching observations in another

Filtering joins- filters observation from one data frame based on whether or not

set operations- treats observations as they are set elements

lets pull you full carrier names based on letter codes

lets get info about airports

lets get info about each plane

lets get some info on the weather at airports