R Final Project

INTRODUCTION TO BIOINFORMATICS WITH R

Simple Opperations

This is how R does addition {r} 20+6

This is how R does subtraction {r} 20-6

This is how we store data as variables. we are going to store the days of the week for this currewnt example: {r} days<- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")

Lets find the fifth entry {r} days[5]

Lets pull out a range of entries {r} days [1:3]

Lets pull out the end of the week {r} days [5:7]

Lets pull out specific days {r} days [c(1,3,5,7)]

Lets subset our “days” variable and create weekdays {r} weekdays <- days[1:5] print(weekdays)

Functions

f(argument1, argument2 …) where f is the name of the function, and argument1, argument2,… are the different, condtions thatg we are asking the function to evaluate

exampleFunction <- function(x,y) {r} {c(x+1, y+10)} exampleFunction(2,4)

Example of built in functions {r} exp(2)

tangents {r} tan(2)

logs ```{r} log(12)

log(x=12, base= 4)

log(12,4)


Data Structures
Lets start with the array function
```{r}
months <- array (c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"), dim=c(3,4))

months

and compare it to the simple list function ```{r} months1 <- c(“Jan”, “Feb”, “Mar”, “Apr”, “May”, “Jun”, “Jul”, “Aug”, “Sep”, “Oct”, “Nov”, “Dec”)

months1

months[2,3]


Lets now look at a matrix
```{r}
 months2 <- matrix(data=c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"), nrow = 3, ncol = 4)
 
 months2

making a 3d array ```{r} array3d <-array(c(2,3,4,5,10,12,14,16,18,20,22,24,26,28,30,32,34,36), dim = c(3,3,2))

print(array3d)

array3d[1,3,2]


 If you  want to pull an entire row or column
```{r}
 array3d[2, ,1]

List and Date Frames ```{r} HSPA1A <- list(gene=“HSPa1a”, amino.acids = 641, nucleotides = 2400)

print(HSPA1A)


 Lets pull out the amino acids
```{r}
 HSPA1A$amino.acids

Lets pull out the nucleotides {r} HSPA1A$nucleotides

Lets create three lists then combine them into a data frame Lets pull out just the genes from the data frames {r} HSPs$gene Lets pull out the nucleotides {r} HSPs$nucleotides

Lets search for a specific amino acid count {r} HSPs$aminoAcids[HSPs$gene == "HSPA8"]

Object Classes ```{r} print(gene) print(HSPs)

class(HSPs$gene)

HSPs$gene

class(HSPs$nucleotides) HSPs$nucleotides

class(HSPs$aminoAcids) HSPs$aminoAcids

x <- 15 +38

class(x)

z <- as.character(c(1,2,3,4,5,6))

class(z)

y <- as.character(c(9,8,7,6,5,4))

z + y

z2 <- c(1,2,3,4,5,6) y2 <- c(9,8,7,6,5,4)

z2 + y2

class(z2)

class(HSPs)

class(y)

y <- as.numeric(y)

class(y)


Models and Formulas
```{r}
y ~ x1 + x2

dataset <- iris

Lets look at the top few rows {r} head(dataset)

Lets look at the bottom few rows {r} tail(dataset)

Lets look at the total number of rows in the dataset

Lets look at the number of rows in the dataset nrow(dataset)

Lets look at the number of colums in the dataset ncol(dataset)

Lets start with a simple linear model ```{r} petals.lm <- lm(formula = Petal.Length ~ Petal.Width, data = dataset)

petals.lm

summary(petals.lm)


Charts and Graphs
```{r}
names(iris)

hist(iris$Sepal.Length)

Lets increase the number of bins in our histogram {r} hist(iris$Sepal.Length, breaks = 25)

Lets add some labels ```{r} hist(iris$Sepal.Length, breaks = 25, xlab = “Sepal Length”, main = “Sepal Length Frequency”)

plot(iris$Sepal.Length ~ iris$Sepal.Width, xlab = “Sepal Length”, ylab = “Sepal width”)

library(lattice)


Lets use the lattice dot plot
```{r}
dotplot(Sepal.Width ~ Sepal.Length|Species, data = iris)

Lets use the lattice dotplot to look at petal length vs width {r} dotplot(Petal.Length ~ Petal.Width|Species, data = dataset)

DATA WRANGLING

Lets upload the librart files ```{r} library(tidyverse)

??flights

my_data <- nycflights13::flights

head(my_data)


First we will just look at the data on october 14th.
```{r}
filter(my_data, month == 10, day ==14)

If we want to subset this into a new variable, we do the following: {r} oct_14_flight <- filter(my_data, month == 10, day ==14)

What if you want to do both print and save the variable? {r} (oct_14_flight_2 <- filter(my_data, month == 10, day ==14))

If you want to filter based on different opperators, you can use the following:

Equals == Not equal to != greater than > less than < greater than or equal to >= Less than or equal to <=

{r} (flight_through_september <- filter(my_data, month == 10, day ==14))

If we don’t use the == to mean equals, we get this: (oct_14_flight_2 <- filter(my_data, month = 10, day =14))

You can also use logical opperators to be more selective

and & or | not !

Lets use the “or” function to pick flights in march and april

```{r} March_April_Flights <- filter(my_data, month == 3 | month == 4)

March_4th_Flights <- filter(my_data, month == 3 & day ==4)

Non_jan_flights <- filter(my_data, month != 1)


 Arrange


 Arrange allows us to arrange the dataset based on the variables we desire.
```{r}
arrange(my_data, year, day, month)

we can also do this in descending fashion {r} descending <- arrange(my_data, desc(year), desc(day), desc(month))

Missing values are always placed at the end of the dataframe regardless of ascending or descending.

Select

We can also select specific columns that we want to look at. {r} calendar <- select(my_data, year, month, day) print(calendar)

We can also look at a range of columns {r} calendar2 <- select(my_data, year:day) Lets look at all columns months through carrier {r} calendar3 <- select(my_data, year:carrier)

We can also choose which columns NOT to include ```{r} everything_else <- select(my_data, -(year:day))

everything_else2 <- select(my_data, !(year:day))


There are also some other helper functions that can help you select the columns or data you're looking for

starts_with("xyz") -- will select the values that start with xyz
ends_with("xyz")  --- will select the values that end with xyz
contains("xyz")   --- will select the values that contain xyz
matches("xyz")   ---- will match the identical value xyz

 Renaming
```{r}
head(my_data)

rename(my_data, departure_time = dep_time)

my_data <- rename(my_data, departure_time = dep_time)

Mutate

What if you want to add new columns to your data frame? we have the mutate() function for that.

First, lets make smaller data frame so we can see what we are doing.

{r} my_data_small <- select(my_data, year:day, distance, air_time)

Lets calculate the speed of the flights. ```{r} mutate(my_data_small, speed = distance / air_time * 60)

my_data_small <- mutate(my_data_small, speed = distance / air_time * 60)


What if we wanted to create a new dataframe with ONLY your calculations (transmute)
```{r}
airspeed <- transmute(my_data_small, speed = distance / air_time * 60 , speed2 = distance / air_time)

Summarize and by_group()

we can use summarize to run a function on a data column to get a single return {r} summarize(my_data, delay = mean(dep_delay, na.rm = TRUE))

So we can see here that the average delay is about 12 minutes We gain additional value in summarize by pairing it with by_group() by_day <- group_by(my_data, year, month, day) {r} summarize(by_day, delay = mean(dep_delay, na.rm = TRUE))

As you can see, we now have the delay by the days of the year

Missing Data

What happens if we don’t tell R what to do with the missing data? {r} summarize(by_day, delay = mean(dep_delay,na.rm = TRUE)) we can also filter our data based on NA (which in this dataset was cancelled flights) {r} not_cancelled <- filter(my_data, !is.na(dep_delay), !is.na(arr_delay))

Lets run summarize again on this data {r} summarize(not_cancelled, dealy = mean(dep_delay))

counts

we can also count the number of variables that are NA {r} sum(is.na(my_data$dep_delay)) we can also count the numbers that are NOT NA {r} sum(!is.na(my_data$dep_delay))

Piping

with tibble datasets (more on them soon), we can pipe results to get rid of the need to use the dollar signs we can then summarize the number of flights by minutes delayed. {r} my_data %>% group_by(year, month, day)%>% summarize(mean = mean(departure_time, na.rm = TRUE))

Tibbles

{r} library(tibble) Now we will take the time to explore tibbles. Tibbles are modified dataframes which tweak some of the older features from data frames. R is an old language, and useful things from 20 years ago are not as useful anymore. {r} as_tibble(iris) As we can see, we have the same data frame, but we have different features

You can also create a tibble from scratch with tibble() {r} tibble( x = 1:5, y = 1, z = x ^ 2 + y ) You can also use tribble() for basic data table creation {r} tribble( ~genea, ~ genab, ~ genac, ######################### 110, 112, 114, 6, 5, 4 ) Tribbles are built to not overwhelm your console when printing data, only showing the first few lines.

This is how a data frame prints ```{r} print(by_day) as.data.frame(by_day) head(by_day)

nycflights13::flights %>% print(n=10, width = Inf)


Subsetting

Subsetting tribbles is easy, similar to data.frames
```{r}
df_tibble <- tibble(nycflights13::flights)

df_tibble

We can subset by column name using the $ {r} df_tibble$carrier We can subset by position using [[]] {r} df_tibble[[2]]

If you want to use this in a pipe, you need to use the “.” placeholder {r} df_tibble %>% .$carrier

Some older functions do not like tibbles. thus you might have to convert them back to dataframe. ```{r} class(df_tibble)

df_tibble_2 <- as.data.frame(df_tibble)

class(df_tibble_2)

df_tibble

head(df_tibble_2)


Tidyr
First we upload our library 
```{r}
library(tidyverse)

How do we make a tidy dataset? Well the tidyverse follows three rules.

#1 - Each variable must have its own column #2 - Each observation has its own row #3 - Each value has its own cell.

It is impossible to satisfy two of the three rules. This leads to the following instructions for tidy data

#1 put each dataset into a tribble #2 put each variable into a column #3 profit

Picking one consistent method of data storage makes for easier understanding of your code and what is happening “under the hood” or behind the scenes.

Lets now look at working with tibbles ```{r} bmi <- tibble(women)

bmi %>% mutate(bmi = (703 * weight)/(height)^2)



 Spreading and Gathering

Sometimes you'll find datasets that don't fit well into a tibble
We'll use the built-in data from tidyverse for this part
```{r}
table4a

As you can see from this data, we have one variable in column A (country) but columns b and c are two of the same. Thus, there are two observations in each row.

To fix this, we can use the gather function {r} table4a %>% gather('1999', '2000', key = 'year', value = 'cases') Lets look at another example {r} table4b As you can see we have the same problem in table 4b {r} table4b %>% gather('1999', '2000', key = 'year', value = "population")

Now what if we want to join these two tables? we can use dplyr ```{r} table4a <- table4a %>% gather(‘1999’, ‘2000’, key = ‘year’, value = ‘cases’) table4b <-table4b %>% gather(‘1999’, ‘2000’, key = ‘year’, value = “population”)

left_join(table4a, table4b)


 Spreading 

Spreading is the opposite if gathering. Lets look at table 2
```{r}
table2

You can see that we have redundant info in column 1 and 2 we can fix that by combining rows 1&2, 3&4 etc. {r} spread(table2, key = type, value = count)

This is the key of what we are turning into columns, the value is what becomes rows/observations In summary, spread makes long tables shorter and wider gather makes wide tables, narrower and longer.

Separating and pull

Now what happens when we have two observations stuck in one column? {r} table3 As you can see, the rate is just the populationand cases combined. We can use separate to fix this {r} table3 %>% separate(rate, into = c("cases", "population"))

However, if you notice, the column type is not correct. {r} table3 %>% separate(rate, into =c("cases", "populate"), convert = TRUE)

You can specify what you want to seperate based on. {r} table3 %>% separate(rate, into =c("cases", "populate"), sep = "/", convert = TRUE)

Lets make this look more tidy {r} table3 %>% separate( year, into = c("century", "year"), convert= TRUE, sep = 2 )

Unite

what happens if we want to do the inverse of separate? ```{r} table5

table5 %>% unite(date, century, year)

table5 %>% unite(date, century, year, sep = ““)


 Missing values

There can be two types of missing values. NA (explicit) or just no entry (implicit)

```{r}
gene_data <- tibble(
  gene = c('a', 'a', 'a', 'a', 'b', 'b', 'b'),
  nuc = c(20, 22, 24, 25, NA, 42, 67),
  run = c(1,2,3,4,2,3,4)
)
gene_data

The nucleotide count for Gene b run 2 is explicit missing. The nucleotide count for gene be run 1 in implicitly missing. One way we can make implicit missing values explicit is by putting runs in columns {r} gene_data %>% spread(gene, nuc)

If we want to remove the missing values, we can use spread and gather, and na.rm = TRUE {r} gene_data %>% spread(gene, nuc) %>% gather(gene, nuc, 'a' : 'b' , na.rm = TRUE) Another way that we can make implicit values explicit, is complete() {r} gene_data %>% complete(gene, run) Sometimes an NA is present to represent a value being carried forward ```{r} treatment <- tribble( ~ person, ~treatment, ~response, ################################################# “Isaac”, 1, 7, NA, 2, 10, NA, 3, 9, “VDB”, 1, 8, NA, 2, 11, NA, 3, 10, )

treatment


What we can do here is use the fill() option
```{r}
treatment %>%
  fill(person)

DPLYR

It is rare that you will be working with a single data table. The DPLYR package allows you to join two data tables based on common values.

Mutate joins - add new variables to one data frame from the matching observations in another Filtering joins - filters observation from one data frame based on whether or note they are present in another. Set operations - treats observations as they are set elements. {r} library(tidyverse) library(nycflights13)

Lets pull full carrier names based on letter codes {r} airlines Lets get info about airports {r} airports Lets get info about each plan {r} planes Lets get some info on the weather at the airports {r} weather Lets get info on singular flights {r} flights Lets look at how these tables connect

Flights -> planes based on tail number Flights -> airlines through carrier Flights -> airports origin and dest Flights -> weather via origin, year/month/day/hour

Keys

keys are unique identifiers per observation primary key uniquely identifies an observation in its own table.

One way to identify a primary key is as follows: {r} planes %>% count(tailnum) %>% filter(n>1) This indicates that the tailnumber is unique {r} planes %>% count(model) %>% filter(n>1)

Mutate join

```{r} flights2 <- flights %>% select(year:day, hour, origin, dest, tailnum, carrier)

flights2

flights2 %>% select(-origin, -dest) %>% left_join(airlines, by = ‘carrier’)


We've now added the airline name to our dataframe from the airline dataframe
 Other types of joins

Inner joins (inner_join) matches a pair of observations when their key is equal
outer joins (outer_join) keeps observations that appear in at least one table


Stringr

We must load our libraries
```{r}
library(tidyverse)
library(stringr)

You can create strings using single or double qoutes ```{r} string1 <- “this is a string” string2 <- ‘to put a “quote” in your string, use the opposite’

string1 string2


If you forget to close your string, you'll get this:
```{r}
string3 <- "where is this string going?"

string3

Just hit escape and try again multiple strings are stored in character vectors {r} string4 <- c("one", "two", "three") string4

Measuring string length ```{r} str_length(string3)

str_length(string4)


Lets combine two strings
```{r}
str_c("X", "Y")

str_c(string1, string2)

you can use sep to control how they are seerated ```{r} str_c(string1, string2, sep = ” “)

str_c(“x”, “y”, “z”, sep = “_“)


 Subsetting strings

you can subset a string using str_sub()
```{r}
HSP <- c("HSP123", "HSP234", "HSP456")

str_sub(HSP, 4,6)

This just drops the first four letters from the strings or you can use negatives to count back from the end {r} str_sub(HSP, -3, -1) You can convert the cases of strings like follows: ```{r} HSP str_to_lower(HSP)

str_to_upper()



 Regular Expression

```{r}
install.packages("htmlwidgets")

x <- c('ATTAGA', 'CGCCCCCGGAT', 'TATTA')

str_view(x, "G")

str_view(x, "TA")

The next step is, “.” where the “.” matches an entry {r} str_view(x, ".G.")

Anchors allow you to match at the start or the ending ```{r} str_view(x, “ATA”)

str_view(x, “TA$”)


Character classes/alternatives

\d matches any digit
\s matches any space
 [abc] matches a, b or c

```{r}
str_view(x, "TA[GT]")

[^anc] watches anything BUT a, b or c {r} str_view(x, "TA[AT]") You can also use | to pick between two alternatives {r} str_view(x, "TA[G|T]")

Detect Matches

str_detect() returns a logical vector the same length of input ```{r} y <- c(“apple”, “banana”, “pear”) y

str_detect(y, “e”)


How many common words contain letter e
words
```{r}
sum(str_detect(words, "e"))

Lets get more complex, what proportioon words end in a vowel? ```{r} mean(str_detect(words, “[aeiou]$”))

mean(str_detect(words, “A[aeiou]”))


Lets find all the words that don't contain "o" or "u"
```{r}
no_o <- !str_detect(words, "[ou]")

no_o

Now lets extract {R} words[!str_detect(words, "[ou]")]

You can also use str_count() to say how many matches there are in string ```{r} x

str_count(x, “[GC]”)


Lets couple this with mutate
```{R}
df <- tibble(
  word = words,
  count = seq_along(word)
)

df

df %>%
  mutate(
    vowels = str_count(words, "[aeiou]"),
    constonants = str_count(words, "[^aeiou]")
  )

MICROARRAY

Limma Microarray Analysis ```{r} setwd(“~/Desktop/classroom/myfiles/bioinformatics in R”) #setwd(“~/Desktop/classroom/myfiles”)

library(ath1121501cdf) library(ath1121501.db) library(annotate) library(affy) library(limma) library(oligo) library(dplyr) library(stats) library(reshape)


 Read the cell files into the directory
```{r}
targets <- readTargets("Bric16_Targets.csv", sep = ",", row.names = "filename")

ab <- ReadAffy()

hist(ab)

Normalizing the microarray experiments ```{r} eset <- affy::rma(ab)

pData(eset)

Lets visualize the normalized data
```{R}
hist(eset)

Lets characterize the data a bit ```{r} ID <- featureNames(eset) Symbol <- getSYMBOL(ID, “ath1121501.db”)

affydata <- read.delim(“AFFY_ATH1_array_elements.txt”)



Differential Gene Expression Analysis

Flight vs Ground
```{r}
condition <- targets$gravity

design <- model.matrix(~factor(condition))
colnames(design) <- c("flight", "ground")

fit <- lmFit(eset, design)
fit <- eBayes(fit)
options(digits = 2)
top <- topTable(fit, coef =2, n=Inf, adjust='fdr')

Lets combine annotations {r} Annot <- data.frame(GENE = sapply(contents(ath1121501GENENAME), paste, collapse = ", "), KEGG = sapply(contents(ath1121501PATH), paste, collapse = ", "), GO = sapply(contents(ath1121501GO), paste, collapse = ", "), SYMBOL = sapply(contents(ath1121501SYMBOL), paste, collapse = ", "), ACCNUM = sapply(contents(ath1121501ACCNUM), paste, collapse = ", "))

Lets merge all the data into one dataframe ```{r} all <- merge(Annot, top, by.x = 0, by.y = 0, all = TRUE)

all2 <- merge(all, affydata, by.x = “Row.names”, by.y = “array_element_name”) ```

Lets convert the ACCNUM to a character value ```{r} all2$ACCNUM <- as.character(all2$ACCNUM)

write.table(all2, file=“BRIC_16_Final.csv”, row.names = TRUE, col.names = TRUE, sep = “)

columns(org.At.tair.db)

foldchanges <- as.data.frame(all2$logFC)

all2$entrez = mapIds(org.At.tair.db, keys = all2$ACCNUM, column = “ENTREZID”, keytype = “TAIR”, multiVals = “first”)

head(all2, 10)


Pathview

```{r}
library(pathview)
library(gage)
library(gageData)
data(kegg.sets.hs)

foldchanges = all2$logFC
names(foldchanges) = all2$entrez

head(foldchanges)

kegg.ath = kegg.gsets("ath", id.type = "entrez")
kegg.ath.sigmet = kegg.ath$kg.sets[kegg.ath$sigment.idx]

lapply(keggres, head)

Lets get the results ```{r} keggres = gage(foldchanges, gsets = kegg.ath.sigmet, same.dir = TRUE)

lapply (keggres, head)

greater <- keggres$greater lessers <- kegres$less

write.csv(greater, file = “BRIC_16_pathview_Greater_Pathways.csv”) write.csv(lessers, file = “BRIC_16_pathview_Lesser_Pathways.csv”)


Lets map to pathways (greaters)
```{r}
keggrespathways = data.frame(id=rownames(keggres$greater), keggres$greater) %>%
  tbl_df() %>%
  filter(row_number() <=5) %>%
  .$id %>%
  as.character()
keggrespathways

keggresids_greater = substr(keggrespathways, start=1, stop=8)
keggresids_greater

genedata <- as.character(all2logFC)

Define a plotting function to apply next {r} plot_pathway = function(pid) pathview(gene.data=foldchanges, pathway. id = pid, species = "ath", new.signature = FALSE) Plot multiple pathways (plots are saved to disk and returns a throwaway object list) {r} tmp = sapply(keggresids_greater, function(pid) pathview(gene.data=foldchanges, pathway.id = pid, species = "ath"))

Lets math pathways lessers ```{r} keggrespathways = data.frame(id=rownames(keggres$less), keggres$less) %>% tbl_df() %>% filter(row_number() <=5) %>% .$id %>% as.character() keggrespathways

keggresids_less = substr(keggrespathways, start=1, stop=8) keggresids_less

genedata <- as.character(all2logFC)


```{r}
 Define a plotting function to apply next

plot_pathway = function(pid) pathview(gene.data=foldchanges, pathway. id = pid, species = "ath", new.signature = FALSE)

```{r} plot multiple pathways (plots are saved to disk and returns a throwaway object list)

tmp = sapply(keggresids_less, function(pid) pathview(gene.data=foldchanges, pathway.id = pid, species = “ath”))





RNA SEQUENCE 

---
title: "DESeq"
author: "LaChanti Phillips"
date: "2025-11-15"
output: html_document
---

```{r}
library("edgeR")
library("dplyr")
library("AnnotationDbi")
library("org.Mm.eg.db")

```{r} setwd(“~/Desktop/classroom/myfiles/bioinformatics in R”)

library(readxl)

rawdata = read.csv (“GLDS-102_rna_seq_Normalized_Counts.csv”, row.names = “gene_id”)

group <- factor(c(‘1’, ‘1’, ‘1’, ‘1’, ‘1’, ‘1’, ‘2’, ‘2’, ‘2’, ‘2’, ‘2’, ‘2’))

dgeGlm <- DGEList(counts = rawdata, group = group) str(dgeGlm)

keep <- rowSums(cpm(dgeGlm)>2 >=4)

dgeGlm <- dgeGlm[keep,]

names(dgeGlm)

dgeGlm[[“samples”]]

design <- model.matrix(~group) design

dgeGlmComDisp <- estimateedGLMCommonDisp(dgeGlm, design, verbose = TRUE)

dgeGlmTrendDisp <- estimateGLMTrendedDisp(dgeGlmComDisp, design)

dgeGlmTagDisp <- estimateGLMTagwiseDisp(dgeGlmTrendDisp, design)

plotBCV(dgeGlmTagDisp)

fit <- glmFit(dgeGlmTagDisp, design)

colnames(coef(fit))

lrt <- glmLRT(fit, coef =2 )



```{r}
columns(org.Mm.eg.db)

ttGlm2 <- as.data.frame(ttGlm$table)

ttGlm2$symbol = mapIds(org.Mm.eg.db, keys)

PROTEIN ALIGNMENT

we must first download the library

```{r} library(msa)

setwd(“~/Desktop/classroom/myfiles/bioinformatics in R”) seq <- readAAStringSet(“hglobin.fa”)

seq



Lets align the 8 different amino acid sequences
```{r}
alignments <- msa(seq, method = "ClustalW")
alignments

msaPrettyPrint(alignments, output = "pdf", showNames = "left", 
               showLogo = "none", askForOverwrite = FALSE,
               verbose = TRUE, file = "whole_aligh.pdf")

msaPrettyPrint(alignments, c(10,30), output = "pdf", showNames = "left", 
               file = "Zoomed_align.pdf", showLogo = "top", askForOverwrite = FALSE,               verbose = TRUE)

Lets make a tree from our alignment Firt download the library {r} library(ape) library(seqinr)

Convert to seqinr alignment -> get the distances and make a tree ```{r} alignment_seqinr <- msaConvert(alignments, type = “seqinr::alignment”)

distances1 <- seqinr::dist.alignment(alignment_seqinr, “identity”)

tree <- ape::nj(distances1)

plot(tree, main = “Phylogenetic Tree of HBA Sequences”)



SYNTENY
 you must first upload your library 
```{r}
library(DECIPHER)

In the first step, we load the libraries and the sequence into long_seqs This is a DNAStringSet object ```{r} setwd(“~/Desktop/classroom/myfiles/bioinformatics in R”)

long_seq <- readDNAStringSet(file.path(getwd(), “datasets”, “ch3”, “plastid_genomes.fa”)) long_seq

Now lets build a temporary SQLite database
```{r}
Seqs2DB(long_seq, "XStringSet", "long_db", names(long_seq))

Now that we have built the database, we can do the following: Find the synthetic blocks {r} synteny <- FindSynteny("long_db")

view blocs with plotting {r} pairs(synteny) plot(synteny)

Make an actual alignment file {r} alignment <- AlignSynteny(synteny, "long_db")

Lets create a structure holding all aligned sytenic blocks for a pair of sequences {r} block <- unlist(alignment[[1]])

We can write to file one alignment at a time {r} writeXStringSet(block, "genome_blocks_out.fa")

UNANNOTATED

We must first download the libraries {r} library(locfit) library(Rsamtools)

Lets create a function that will load the gene region information in a GFF file and convert it to a bioconductor GRanges object {r} get_annotated_regions_from_gff <- function(file_name) {gff <- rtracklayer::inport.gff(file_name) as(gff, "GRanges")}

Get count in windows across the genome in 500bp segments ```{r} whole_genome <- csaw::windowCounts(

file.path setwd(“~/Desktop/classroom/myfiles/bioinformatics in R”) file.path(getwd() “datasets”, “ch1”, “windows.bam”), bin = TRUE, filter = 0, width = 500, param = csaw::readParam( minq = 20, determines what is a passing read deduq = TRUE, removes pcr duplicate pe = “both” requires that both pairs of reads are aligned))


Since this is a single column of data, let's rename it
```{r}
colnames(whole_genome) <-("small_data")

annotated_regions <- get_annotated_regions_from_gff(file.path(getwd(), "datasets", "ch1", "genes.gff"))

Now that we have the windows of high expression, we want to see if any of them overlap with annotated regions. We must download the libraries

{r} library(IRanges) library(SummarizedExperiment) Find the overlaps between the windows we made ```{r} windows_in_genes <- IRanges::overlapsAny( SummarizedExperiment::rowRanges(whole_genome), creates a Granges object annotated_regions)

windows_in_genes


 Here we subset the whole_genome object onto annotated and nonannotated regions
```{r}
annotated_window_counts <- whole_genome[windows_in_genes,]
non_annotated_window_counts <-whole_genome[!windows_in_genes,]

Use assay () to extract the actual counts from the Granges object assay(non_annotated_window_counts) In this step, we use Rsamtools Puleup() function to get a per-base coverage dataframe. Each row represents a single nucloetide in the reference count and the count column Gives the depth of coverage at that point ```{r} library(bumphunter)

pile_df <- Rsamtools::pileup(file.path(getwd(), “datasets”, “ch1”, “windows.bam”))


This step groups the reads with certain distance of each other into a cluster. we give the sequence names, position and distance.

```{r}
clusters <- bumphunter::clusterMaker(pile_df$seqnames, pile_df$pos, maxGap = 100)

table(clusters)

In this step, we will map the reads to the regions we found for the genome

{r} bumphunter::regionFinder(pile_df$count, pile_df$seqnames, pile_df$pos, clusters, cutoff = 1)

PHYLOGENIC ANALYSIS

Treeio

Lets load the required packages

{r} library(ggplot2) library(ggtree) library(treeio)

First we need to load our raw tree data. Its a Newick format so we use:

{r} itol <- ape::read.tree("itol.nwk")

Now we will print out a very basic phylogenetic tree

{r} ggtree(itol)

we can also change the format to make it a circular tree {r} ggtree(itol, layout = "circular")

we can also change the left-right/ up-down direction {r} ggtree(itol) + coord_flip() + scale_x_reverse()

By using geom_tipla() we can add names to the end of tips {r} ggtree(itol) + geom_tiplab(color = "indianred", size = 1.5)

By adding a geom_strip() layer we can annotate clades in the tree with a block of color {r} ggtree(itol, layout = "unrooted") + geom_strip(13,14, color = "red", barsize = 1)

we can highlight clades in unrooted trees with blobs of color using geom_hilight ```{r} ggtree(itol, layout = “unrooted”) + geom_hilight(node = 11, type = “encircle”, fill = “steelblue”)

install.packages(‘BAMMtools’) library(BAMMtools)



We can use the MRCA (most recent common ancestor) function to find the node we want

```(r)
getmrca(itol, "Photorhabdus_luminescens", "Blochmannia_floridanus")

Now if we want to highlight the section of the most recent common ancester between the two above {r} ggtree(itol, layout = "unrooted") + geom_hilight(node = 206, type = "encircle", fill = "steelblue")

R Final Project

LaChanti Phillips

11-12-2025

`11-12-2025`