Clements Lab

0.1 R packages used

A variety of R packages was used for this analysis. All graphics and data wrangling were handled using the tidyverse suite of packages. All packages used are available from the Comprehensive R Archive Network (CRAN), Bioconductor.org, or Github.

0.2 Course Overview

Goals:

Learn to analyze your own bulk and single cell RNA-seq data
Develop a lightweight and reusable RNA-seq pipeline
Best practices for working in R/bioconductor
Become a better data scientist
Learn basic building blocks to create publication-quality graphics
Learn to report your work in a transparent and reproducible way

Course Path:

Data import
Data wrangling
Data exploration
Accessing public data
Feature selection
Finding modules
Functional enrichment analysis
Dynamic reports

0.3 Intro to RNAseq technology and data

0.3.1 Part 1: Illumina’s SBS

Shankar Balasubramanian and David Klenerman were scientists who were interested in the use of fluorescently tagged nucleotides in RNA sequencing. This led to illumina sequencing (‘sequencing by synthesis’ technology).
There were many different sequencing techniques around 2012, however, many became obsolete.
The MiniSeq, MiSeq, NextSeq, and NoveSeq all perform RNA sequencing, however, each are capable of different runtimes and numbers of reads. The wide array of sequencing technology options has led to the democratization across organizations and institutions.
There has been a shift in time and money associated with sequencing. Sequencing does not require much time anymore, so you are able to make much larger and more complex experiments.
The major technological hurdle associated with illumina sequencing was the inability to conduct single molecule sequencing at the time.
‘Sequencing by Synthesis’ (SBS) is the process by which a single molecule is taken, folded over to hybridize the primer, and syntehsize a second reverse strand.
PCR colony (‘polony’) size and distance can be controlled easily with SBS.
Clustering density is how close or far colonies are from each other.

*** Overclustering is when there are colonies that are too close to each other, and there are too many to collect. It is hard to identify the data from a crowded slide.

*** Underclustering is when there are too few colonies for collection. It is great data, but very sparse.
How the sequencing happens (Illumina Sequencing Source): Once a polony is of an appropriate size, the data collection begins.
1. The DNA is broken up into small fragments and adaptors (short DNA sequences) attach to the fragments.
2. Through incubating the fragments with sodium hydroxide, the double-stranded DNA fragments become single stranded.
3. The DNA fragments are then washed across the flowcell (sample cells), complementary DNA molecules stick to the flow cell, and rest is washed away.
4. The flow cell DNA is replicated to form clusters of DNA with the same sequence. Each cluster emits a signal that can be picked up by a camera when sequenced.
5. Extra nucleotide bases are added which creates double-stranded bridges of DNA between the primers and the flowcell base.
6. Using heat, the DNA is made single-stranded resulting in multiple clusters of identical single-stranded DNA sequences.
7. Fluorescently tagged nucleotides are added to the flowcell and they compete for addition to the growing chains. Primers are also added at this time.
8. The primer binds to the DNA via DNA polymerase, and adds one fluorescently tagged nucleotides to the strand.
9. Lasers activate the fluorescence of the flow cell, and each nucleotide base emits a different color.
10. The fluorescent nucleotide is removed and another is added. Repeat.

0.3.2 Part 2: Study Design

You should always have an independent evaluation that your experiment worked. We do not want the RNAseq data itself to be proof of experimental success. We need to have validation that our experiment is doing what we think it should be doing.
We need to have controls for our study.
What insight is to be gained from each treatment or condition? We want to know what our “prize” analysis is, but also some secondary comparisons that maybe are not as interesting. These secondary and tertiary conditions help show that the experiment worked.
What signal and noise are we expecting (signal-to-noise ratio)? There could be contamination in the cells (noise) that could impact our ability to detect true changes (signal).
How many replicates do we need vs. how deep should I sequence? We should compare the two, but we typically see that more replicates is cheaper and often better than deeper sequencing. Shallow sequencing can still reveal important biological patterns. If you do more than 12 replicates, you will have a good ability to detect truly expressed genes. ‘Standard’ replicate amounts range from 2 to 3. The difference in results of varying read depth are extremely small, but deeper reads are better for alternative splicing.
What is the cost? Library preparation and sequencing are the two main factors.

We are measuring gene expression in nucleated red blood cell progenitors in the presence of poly I:C (a double stranded RNA that mimics viral infection). Poly I:C (PIC) is not able to travel through the membrane on its own, so we use Lipofectamine (LF) (a lipid carrier that fuses with the plasma membrane) to carry it through to inside the cell to trigger an immune signaling pathway.
We have three replicates for when poly I:C and LF were present, and when just LF was present (vehicle control).
We conducted paired-end sequencing (R1, R2) meaning that it read from the 5’ to 3’ end, and from the 3’ to 5’ end to increase our confidence in the reads.

0.4 Setting up your software environment

0.4.1 Part 1: Software tools

Compute resources:
1. Lab workstation - unrestricted access, high costs
2. The cloud - scalable, easy set-up, data transfer, not secure, potential charges
3. Compute cluster- scalable, highly cost-efficient, secure
- R/bioconductor = free & open-source, don’t need to write code de novo, huge user community, reprodcubile results, modular & standardized, not only transcriptomics, publication quality graphics, no single ‘right way’ to do anything, lacking in terms of user interface

0.4.2 Part 2: Installing software

Download most current version of R
RStudio provides is with an Integrated Development Environment (IDE); it has a text-editor, workspace, library/plots/file-browser/help, and terminal built-in
R packages have multiple uses with a focus on stats, graphics, and modeling
You will always want to do any updates that it provides
We will install Kallisto and other tools using Conda
Conda is an open-source environment management system that is critical for management of dependencies
We need to configure our Conda channels as well
You need to create your first Conda environment to download Callisto
Conda environments are siloed
Downloaded Bioconductor, Conda, and all related dependencies

0.5 Ultra-fast Map Reading with Kallisto

Read mapping is the process of aligning the reads with a reference genome

0.5.1 Part 1: FastQC & FastP

Try not to pollute the base environment and download packages in separate environments
Activate the environment with conda activate, then call the program you want to use
Do not unzip/decompress the files (too much space than your local computer can hold)
Read quality as our number of cycles increases
Use fastqc to perform a quality assessment on our data (fastqc *.gz -t 8) or fastp (fastp -i sample1.gz -t 8)
HTML fastqc file can be opened in a web browser to view all of the fastqc data
Fastqc will give you plots and statistics; it may fail to report
We want duplicate sequences with rnaseq
You do not have to run fastp (I had some issues with it)
qc = quality control of our experimental mRNA data

0.5.2 Part 2: Kallisto & MultiQC

Downloaded DNA fasta file from [https://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/dna/] (Homo_sapiens.GRCh38.cdna.all.fa.gz)
Drop fasta file in sublime to view the data
It contains multiple entries with the gene id, gene symbol, transcription, and the gene itself
We need to build an index (you only need to do it once)
Be careful with file names! (& do not use spaces)
Ran:

kallisto index -i Homo_sapiens.GRCh38.cdna.all.index Homo_sapiens.GRCh38.cdna.all.fa

Found that it has 1037056 contigs (contiguous reads) and contains 116726086 k-mers
Creates a large index file as a result
Mapping single-end data with:

kallisto quant \
-i Homo_sapiens.GRCh38.cdna.all.index \
-o test \ 
-t 8 \
--single -l 250 -s 30 \
fastq file name \ 
&> test.log

Multi-QC demonstration: generates output and HTML files
Multiqc summarizes outputs

multiqc -d .

Mapping paired-end data with:

kallisto quant \
-i Homo_sapiens.GRCh38.cdna.all.index \
-o test \ 
-t 8 \
--single -l 250 -s 30 \
read 1 fastq file name \ 
read 2 fastq file name \
&> test.log

0.5.3 Part 3: Methods for Quantifying Gene Expression

Kallisto is for alignment (matches mRNA sequences with human DNA genome)
We want to automate many fastq files (index construction & mapping)
List all of the commands in a .sh file (shell script)
Go to the command line and write all of the lines of the shell script
Download readMapping.sh file and drop it on Sublime; it gives you a pipeline
Read mapping is text pattern matching
We have a tradeoff with building a index: computational complexity (STAR)
Burrows-Wheeler Transform (BWT) is used to sort characters alphabetically in a file; it is the last column
Sort rows lexicographically based on first column
Suffix array allow you to search lexicographically (align)
Take a baseline sequence and find matches (less computationally intensive), rather than a single letter at a time
RSubread aligner = vote to see where alignment is most likely; make alignment process much less computationally extensive
Sailfish = exact alignment of short k-mers; not approximate matching and extension part; shred the reads into k-mers; involves few parameters
Idea of alignment was abandoned for Kallisto:

Create transcriptome de Brujin graph (T-BDG) using k-mers (31 nucleotides)
Move through reference each time and ask about compatibility with our transcripts
There will be splits to show you compatible reads
Instead of aligning, we find transcripts that are compatible with the read
Don’t even need to search for all possible k-mers
You can skip between skip nodes and non-skip nodes
For most reads, only two k-mers were hashed because all k-mers from the entire read mapped to a continguous region in the T-DBG
It is really fast; they can generate bootstraps with -b which can help you model technical variance in dataset (we are not going to do these bootstraps)

Used the following code to run readMapping.sh (all of fastqc files)
It resulted in 1037056 contigs and 116726086 k-mers

chmod 777 readMapping.sh
./readMapping.sh

Getting the following error for HTS0101, HTS0302, HTS0701, HTS0801, and HTS0901.

Failed to process file HTS048_09_3C_LF_S9_L001_R1_001.fastq
uk.ac.babraham.FastQC.Sequence.SequenceFormatException: Ran out of data in the middle of a fastq entry.  Your file is probably truncated
        at uk.ac.babraham.FastQC.Sequence.FastQFile.readNext(FastQFile.java:187)
        at uk.ac.babraham.FastQC.Sequence.FastQFile.next(FastQFile.java:129)
        at uk.ac.babraham.FastQC.Analysis.AnalysisRunner.run(AnalysisRunner.java:77)
        at java.base/java.lang.Thread.run(Thread.java:1623)

Originally ran with single-end read mapping, but should have used paired-end read mapping to demonstrate reading from 3’ to 5’ and 5’ to 3’ for completeness
- With single-end read mapping, reads get sloppy at the beginning and end (paired-end avoids this)
- This is why quality control failed at per base sequence content
To rerun with paired-end read mapping, I edited readMapping.sh, ran conda activate, navigated to raw Data folder, and ran ./readMapping.sh.

0.6 Understanding RNAseq Count Data

0.6.1 Part 1: Preamble

Downloaded step 1 script
We have used command line tools to assess the quality of the raw reads (quality control)
Used Kallisto to take a fasta file and turn it into an indexed file (constructing an index - makes it easier and faster to search again)
Used Kallisto to map our raw reads against the transcriptome to get summarized read counts
I do not have a tsv file from Kallisto - going to rerun to see if I get one

0.6.2 Part 2: Measuring Digital Gene Expression

Effective length of transcript is taking the actual length - length of the fragment + 1 (length of a transcript after adjusting for the total number of possible positions a fragment of size X could originate from)
Transcript is from the transcriptome & fragment is one of our reads
RNAseq gives a relative measure of gene expression
Techniques used to measure mRNA abundance and reads for RNAseq data are used to estimate a statistic that is as closely proportional to the relative molar concentration as possible
Understanding the units for measurement for RNAseq is critical to understanding how we determine differential expression
Normalization = any number of ways that a dataset is globally altered to improve our ability to detect differently expressed genes
What happens to our data when we tweak it in different ways?
The gene that is the lowest expressed has a low SD & vice versa
This is problem because it makes it harder to pull these highly expressed genes out (heterscedasticity)
Convert raw counts with a log 2 conversion, to make the most lowly expressed gene have the highest SD & vice versa
With large datasets, fixing one problem globally creates another (check your assumptions and fit them to your data)
It is better to represent RNAseq data in a log 2 scale
The expected reads per sample would be even for all samples (the world is not this way, will be unbalanced) - between samples; adjusts for the length of feature
Scaling units for within-sample comparisons - different regions of may be different sizes; adjusts for depth of sequencing of sample
reads per kilobase, per million reads (RPKM) = between sample + within sample; unit of expression for RNAseq data; (# of reads mapped to a genomic region) / [(region length in kb)*(total # reads)] * 10^6
RPKM is no longer the accepted unit; no sufficient for normalization
The problem with RPKM is that the average relative molar concentration for each and every sample of RNAseq data mapped to the same genome is the same value - averages will be very very different between samples
Fixing RPKM is easy - scale by gene length first and create the transcript per million (TPM) = (reads per Kn) / (total RPK in sample); now the TPMs all total to 1
Further normalization steps need to be taken before one can compare between samples
Library size scaling is too simple for many biological applications
Scaling will all be messed up because of the number of original genes matching to each one
Most current normalization methods attempt to find a set of genes with minimal variance across samples for normalization
Statistical tools make inherent assumptions about your data
Most genes are not differentially expressed in your dataset (implications for comparing very different treatments/conditions)
Approximating equivalent numbers of up and down regulated genes

0.6.3 Part 3: Starting Our R Project and Step 1 Script

When you start a new R project, it starts a new R Session, the .Rprofile in the project’s main directory, access commands in .Rhistory file
R script != to RStudio prject (can have many R scripts associated with it)
Edit -> Folding -> Collapse All to break down script into code chunks (foldable; add “—” after title)
Edit -> Folding -> Expand All to undo it
RStudio is text-aware
Click install on right-hand pane under “Packages”
Run lines of code by highlighting them

setRepositories() # installs all R packages
 1 2 3 4 5 6 7 8

0.7 Starting Your R Workflow

0.7.1 Part 1: Starting Step 1 Script

Tidyverse is an environment of R packages
?read_tsv gives you manual page (help documentation)
If you highlight an object and click Control+Enter, it will show you its contents
setRepositories() before installing R packages
Automate file paths
Downloaded studydesign.txt from data section and updated it with appropriate folder names
My all(file.exists(path)) only produces one true - but it ends up working

0.7.2 Part 2: Tapping Into Annotation Databases and Reading Kallisto Data into R

Want to map transcription level expression data to gene level expression data (need to have one-to-one mapping of transcript ids to gene ids) - this is because a transcript is not helpful unless we know which gene it is from
BiomaRt allows users to connect to a database of many different annotations
You can pull out all of the gene level annotations for a dog (ex.)
A tibble is a tidied-up data frame
With non-model organisms, there are often fewer gene symbols than transcript identifiers available
Toggling txOut to TRUE will allow you to change the ordering of genes to transcripts
We used the tximport package to read our Kallisto outputs into the R environment
We used EnsemblDB to attach gene level information to our transcription data (transfers our data to gene level)

0.8 Wrangling Gene Expression Data

0.8.1 Part 1: Starting Step 2 Script

The Step 2 Script will be used filter our data and get normalized read counts
R Data Types:
vector = series of objects
matrix = has rows & columns; will have the same kind of data
array = several matrices stacked together
dataframe = matrix but can have mix of types
list = any combination of the above data types
Our transcript list will be far bigger than our gene list
The # of elements is the number of data points in the matrix
There are a lot of transcripts with no evidence of gene expression data
“Counts” are the actual number of reads matching to that gene in that sample (it is an element in our list)
“Abundance” refers to the number of reads present
Sum of our columns in myTPM should be close to 1 million (shows how many of those transcripts are gene maps) - transcripts per million refers to transcripts not genes

colSums(myTPM)
[1] 912920.2 910074.9 911529.2 909853.9 909855.2 909729.7

Sum of our columns in gene level data does not sum to 1 million (myTPM_Counts)

colSums(myTPM_Counts)
[1] 48423530 59355199 43597530 52757379 53130928 52197283

Use $ to get a column of a data frame

0.8.2 Part 2: Walking Through the Step 2 Script, & Relating Our Work to the ‘Grammar of Graphics’ and ‘Tidy’ Data

It is extremely easy to manipulate a data matrix (calculate standard deviation, average, and median)
ggplot() creates a graph using layers (you can stack these - styles) - give it the data you want it to graph and give it the aesthetics (‘grammar graphics’)

myTPM.stats <- transform(myTPM, 
                         SD=rowSds(myTPM), 
                         AVG=rowMeans(myTPM),
                         MED=rowMedians(myTPM))
                         
ggplot(myTPM.stats) + 
  aes(x = SD, y = MED) +
  geom_point(shape=25, size=3)

There is heteroskedasticty across our data (unequal variance across median values) - representing our data on a log2 scale helps with this (impacts of count data)
ggplot2 shape parameter = menu for type of shape for data points

Summary Stats (Median v. SD) For Each Transcript or Gene:

Use geom_hex instead of geom_point to represent data as a hexagonal bin (heatmap) - shows relative expression of a certain gene

ggplot(myTPM.stats) + 
  aes(x = SD, y = MED) +
  geom_hex(shape=25, size=3)

You can specialize themes of your ggplot with various theme packages
ggplot are plots on plots (layered)
Benefits of ggplots:
Users can iteratively update a plot, changing specific elements as needed
Even high-level aspects of a plot can be changed
Layers allow plots to be overplayed
Layers also make it easy to use # to toggle layers on/off
Default arguments for each layer minimize the need for users to dictate details
We make a DGElist (data structure) with our counts - has two elements, 1. counts and matrix of sample labels, 2. samples with group library size and normalization factors (good to know how large our files are)
We utilize counts per million instead of TPM because it shows how many times we hit this particular gene, rather than reading a lot of short fragments that don’t really align (noise)

Counts Per Million on Log2 Scale:

We will coerce our data matrix (respects row names) to be a data frame (does not respect row names - we need to save these as a column gene idea)
When we wrangle our data, we will often lose our sample labels, so we will need to add them back in
We will now work to filter our data to reduce any noise
Tidy data philosophy = “A dataset is messy or tidy (long) depending on how rows, columns and tables are matched up with observations, variables, and types.”

Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.

Gene expression data is always messy (wide) - we need to take every variable and make it form a column
Tidy data is ideal for graphing - values are easily retrieved as a single vector
We can plot the statistical summary of our data - plot every sample on the x-axis and the log2 on the y-axis

Filtered Data:

The distribution of the data for each sample, and we were able to bin the data by sample
It would be unusual if there was a lot of variation among samples
You can flip the coordinates on the sample by adding coord_flip()
We are trying to make a logical matrix of all of the genes
12229 / 35371 gene samples have no counts at all (great!) - you can summarize this with the table command
We now set the cutoff for the number of samples in the smallest group of comparison & filter it
We are using subsetting to tell it which rows or columns we want to remove or keep
The following result means that we removed 11752 rows of our data that did not have more than one count per million (cpm) in at least 3 of the samples

keepers <- rowSums(cpm>1)>=3
myDGEList.filtered <- myDGEList[keepers,]
dim(myDGEList.filtered)
[1] 11752     6

We now normalize our data to make them have consistent units
The normalization factors are now not all 1 because it is supposed to control for differences in distributions (tweaked up and down)
We need to normalize our data by applying a normalization method “TMM”
Our normalization of the data does not look very different from the non-normalized data meaning that we have lots of data and reads

log2.cpm.filtered.norm <- cpm(myDGEList.filtered.norm, log=TRUE)
log2.cpm.filtered.norm.df <- as_tibble(log2.cpm.filtered.norm, rownames = "geneID")
colnames(log2.cpm.filtered.norm.df) <- c("geneID", sampleLabels)

Normalized Data:

Today, we took plotted the raw, filtered, and normalized data and evaluated their differences.

Combined Graphs:

0.9 Data Exploration

0.9.1 Part 1: Discussion of multivariate data, dimensional reduction via PCA, and starting our Step 3 script

Tidyverse packages and functions only work with data frames
The readr package always produces a data frame
tibble = data frame with nicer display and subsetting
piping = makes code more succinct and readable (%>%)
When it comes to sources of variance, not all experiments are created equal
As you increase the number of variables, you increase the variance (the type of experiment you do will influence your sources of variance and the types of data you need to record)
examples of sources of variance: people conducting it, batch prep, variable confounded with variable of interest, sex (did we check for this?)
You need to record every variable in order to visualize the variance of our data
Dimensional reduction = reducing number of variables; PCA is a method of this
Principle Component Analysis (PCA) = find a linear combination of your variables that captures the largest spread (variance) in your cloud of data points (line of best fit); reduces number of dimensions in data set; you then draw a line perpendicular to that one (will account for less of variance than the first line); unsupervised learning method which has no information about sample labels
Running studydesign.txt file which tells us what PIC and LF are
We pull out the virus/healthy determinations and identify that these are the 2 classes with factor(group)
Hierarchical clustering only works on matrices; find the distance between points and cluster them (produces a diagonal matrix, shows distance of every point to another); produces a tree (dendogram)
right/left has no meaning in these kinds of trees, only vertical distance matters (reflects distance in distance matrix)
It is really interesting that the LF and PIC samples are not clustering together respectively
Is LF08 really an outlier? - maybe has the fewest of all reads aligning (may be technical reason why it is different and not biological)

Cluster Dendogram of Raw Data

The results below show that bulk of our data is explained by PC1, 2, 3, and 4

summary(pca.res) # Prints variance summary for all principal components.
Importance of components:
                          PC1     PC2     PC3     PC4     PC5       PC6
Standard deviation     12.834 12.4688 12.1273 11.0053 10.1853 6.316e-14
Proportion of Variance  0.238  0.2246  0.2125  0.1750  0.1499 0.000e+00
Cumulative Proportion   0.238  0.4626  0.6751  0.8501  1.0000 1.000e+00

rotation = how much each gene influenced each PC
x = how much each sample influenced each PC.

Variance by Each PC

The following results show the variance per PC, this means that our data is pretty spread out

> pc.per
[1] 23.8 22.5 21.2 17.5 15.0  0.0

0.9.2 Part 2: Plotting PCA results and small multiples

The data has both magnitude and direction - shows you how much each sample contributed to each PC

Original PCA Plot

You can use objects as axis labels
We now add colors to make our graph more informative

PCA Plot Colored

This graph does not really give meaning to PC1 (does not graph by LF or PIC)
Additionally, the function stated there were two few data points in order to accurately cluster the samples

PCA Plot Labeled

‘Small multiples’ plot = similar graphs using the same axes and labels; make it easy to spot interesting combinations of variables; shows magnitude and direction
pipe function = %>% (Magrittr function as part of tidyverse); streamlines our code; runs multiple functions on successive code in one line
A tibble allows us to view our data as a single numeric vector

PCA Small Multiples Plot

None of our small multiple plots of PCs tell us about healthy vs virus
PC2 and PC4 tells us about outliers
Take the time to thoroughly explore your data before downstream analysis
.fastq files have a lot of information in the headers

0.9.3 Part 3: Producing interactive tables and plots

dplyr = package used to mutate (columns), filter (rows), select (columns), and arrange (rows) things in data frames
gt package is great for making tables

Gene Expression in the Presence vs Absence of Poly I:C Formatted

use ggplot to make interactive (tooltip) plots

LogFC Table Formatted

These plots do not tell us anything about significance however
This graph shows that the same genes tend to be expressed in the presence and absence of poly I:C
It is possible that I did not choose genes involved in the immune response but are rather expressed all of the time

0.10 Accessing Public Data

0.10.1 Part 1: Accessing Public RNA-seq Data

Sequence Read Archive (SRA) = NCBI public sequencing database
You can directly download the .fastq files from there
There are SRA tools for the command line
Gene Expression Omnibus (GEO) = data generation outpaces analysis
Community-driven projects have helped to standardize sample collection, storage and data generation
Tools for rapid and large-scale of public RNA-seq data have lagged behind
Hierarchical Data Format v5 (HDF5) = self-describing file format; uses groups and datasets instead of folders and files; perfect for storing genetic expression data; have different types of data in same file
ARCHS4 = “All RNA-seq and ChIP-seq sample sequence search”; scales resources to massive alignment

0.10.2 Part 2: Step 4 Script to access the ARCHS4 Database

Had trouble downloading the human and mouse sample files
hdf5 file = has groups and datasets; contains metadata;
Samples from the same study are grouped together in the ARCHS4 Database
You can create your own studydesign file with ARCHS4 (with sample title, sample source, and sample characteristics)
Use it to determine the relationship between your PCs and the metadata
Will go back in later with results once sample files finish downloading

0.11 Checking Viability of Results

Since the photos produced above show that there is little differentiation between the PIC and LF samples, we re-ran scripts 1, 2, and 3 above to ensure that this is correct
We reran them because based on our hypothesis we would expect our control and experimental conditions to produce varying results in gene expression, however, this is not what we see
Following making PCA plots for filtered & unnormalized data, unfiltered & normalized data, and unfiltered & unnormalized data (and with Rebecca’s checking), we found that there are indeed many similarities between the two conditions
As a result, I will continue to carry out the following analyses with this in mind
I reran the scripts to ensure it was filtered and normalized
I filtered for these specific immune genes & there were no results. This may mean that there was no expression of these genes.

geneID=="MMP1" | geneID=="GZMB" | geneID=="IL1B" | geneID=="GNLY" | geneID=="IFNG" | geneID=="CCL4" | geneID=="PRF1" | geneID=="APOBEC3A" | geneID=="UNC13A"

0.12 Differential Gene Expression

0.12.1 Part 1: Introduction to Differential Feature Selection in R/Bioconductor

We have been preparing for DGE analysis all along
Kallisto can do sequencing so quickly through bootstrapping - repeated bootstrapping can give insight into uncertainty for counts of RNAseq data
DGE = differential gene expression
DTE = differential transcript expression
DTU = differential transcript usage (what percentage of output is due to specific isoforms - isoform switching)
Genes can be more accurately measured than transcripts (compare actual TPM to estimated TPM)
If you aggregate transcript-level data to gene-level data, you get more accurate counts of gene expression

0.12.2 Part 2: Starting Step 5 Script for DGE Analysis

design matrix = choice of comparisons is not inherent from studydesign file; explicit about what experimental design is
model matrix = helps you set up how you are going to apply a linear model; assign 1 or 0 according to PIC or LF
RNAseq data has an unequal variance/range across it (heteroskedasticity)
Voom = package that models mean-variance relationship & will stabilize your data
We are fitting a linear model to our data to view the mean-variance trend
The genes that are more differentially expressed have a lower standard deviation - stats.

Bayesian Stats:

0.12.3 Part 3: Visualizing DGEs

Volcano plot = representation of impact a treatment/condition that a pairwise comparison has on our dataset

Our Volcano plot does not look like his - things to the right of 0 are upregulated in virus genes, things to the left are healthy genes
Genes with low p-values are very high up
We can set arbitrary cutoffs and visualize them

I adjusted the cutoffs to highlight the 7 genes that were upregulated by the presence of poly I:C

We can also pull out the DEGs - slightly odd to me that we only had 7 genes that were differentially expressed in the presence of Poly:IC
Same level of expression at 0 between 2 conditions
More in Poly I:C condition to right of 0 - which is what we see here
We only care about things that are really highly up-regulated (spread out further on X)
Vertically on the graph shows p-value (significance)
Cutoff p-value of 2 is too high –> changed to 1.3
Not a lot of genes turning off - common for viral immune response

> summary(results)
       infection
Down           0
NotSig     11745
Up             7

Had to review some code and ensure I changed disease to virus for studydesign purposes

Venn Diagram of DEGs:

Data Table of DEGs:

ISGs are interferon stimulated genes
Look uo the genes and understand more about what they do, especially DISP
ISG15 = “interferon-stimulated gene”; known role in antiviral immune response
IRF7 = “interferon regulatory factor 7”; transcriptional activation of virus-inducible cellular genes; master regulator of IFN-I immune responses
IFIT3 = “interferon-stimulated gene 49”; stimulates IFN production during RNA virus infections
IFIT1 = antiviral protein that recognizes 5′-triphosphate RNA
IFI6 = IFN-stimulated gene (ISG) whose expression is highly regulated by the stimulation of type I IFN-alpha that restricts various kinds of virus infections by targeting different stages of the viral life cycle
IFI27 = counteracts innate immune responses after viral infections by interfering with RIG-I signaling
DISP2 = one of two human homologs of a segment-polarity gene known as dispatched identified in Drosophila; product of this gene may be required for normal Hedgehog (Hh) signaling during embryonic pattern formation
- A Headgehog pathway is a signaling pathway that transmits information to embryonic cells required for proper cell differentiation
Overall, the presence of poly I:C upregulated antiviral immune-related genes

0.13 Module Identification

0.13.1 Part 1: Intro to Clustering and Starting the Step 6 Script

Clustering = group genes or transcripts together based on their behavior across the samples we measure the expression in; results in modules of expressed genes
We need to separate out genes that are behaving in very different ways
We are creating a character vector of colors (Rcolorbrewer)
Having trouble creating diffgenes - need to re-check DEGList
Changed dim(diffgenes) to dim(diffgenes.df)
Replace “Data” section with (just use diffGenes from Step5 script):

results <- decideTests(ebFit, method="global", adjust.method="BH", p.value=0.01, lfc=2)
colnames(v.DEGList.filtered.norm$E) <- sampleLabels
dim(diffGenes)

0.13.2 Part 2: Making and Interpreting Heatmaps

Making a heatmap out of a data matrix
Make a distance matrix based on correlations of our genes
First, cluster genes (rows) in each set of differentially expressed genes (need to find pairwise correlations of our genes)
Secondly, cluster our samples (columns)
Colorblindness matters when creating heatmaps, be aware
scale = ‘row’; heatmpa according to row z-scores; puts emphasis on comparing between samples
scale = ‘none’; heatmap according to value; puts emphasis on comparing between genes (will skew image so you view most highly expressed genes

Cluster Heatmap:

Cluster Interative Heatmap:

We now want to pull out modules of co-expressed genes (genes that are upregulated or downregulated)
Changed code below to 1:7 instead of 1:2307

module.assign.pivot <- pivot_longer(module.assign.df, # dataframe to be pivoted
                          cols = 1:7, # column names to be stored as a SINGLE variable
                          names_to = "geneID", # name of that new variable (column)
                          values_to = "module") # name of new variable (column) storing all the values (data)

Heatmap for Each Subcluster:

0.13.3 Part 3: Alternative Clustering Methods

Clustering:
- Unsupervised = blind to groups (hierarchical, k-means, etc.); assign all genes to clusters (data partioning); genes can only belong to one module (binary)
- Supervised = group aware (random forest, neural nets, etc.)
By adding a gene that does not fit into a module, we are creating false positives
Modules can differ dramatically based on clustering method used
Clust generates “tight” modules; command-line tool

0.14 Functional Enrichment Analysis

0.14.1 Part 1: Introduction to Enrichment Analysis Methods

Functional Enrichment Analysis = go from a collection of genes that have patterns to them, to understanding what the genes do; “pathway analysis”
2 main approaches:
- Gene Ontology (GO) = structured vocabulary for describing gene function; requires arbitrary selection of genes; does not require any data (gene identifiers only); species specific; user doesn’t control vocabulary; manually curated; answers where a gene product is localized, what functions it performs in that location, and how it carries out that function; redundant and often requires summarization of results
- Gene Set Enrichment Analysis (GSEA) = no need to pre-select genes; requires both identifiers and data; not species specific; user can control vocabulary; like counting cards; 1) ranks rows/transcripts in order of magnitude of change (up/downregulation), 2) rank order genes; 2 types: self-contained (are any of the genes in my set differentially expressed?) and competitive (are genes in set more commonly differentially expressed than genes in set?)
By clustering the genes in modules, we improve our ability to detect functionaly enriched themes in these genes
Think about why you set out to do an RNASeq analysis in the first place (regulation of gene expression under our conditions)
MSigDB = curated signature database
Permutation = must be conducted for GSEA to determine if an enrichment score is actually statistically significant; permuting samples is preferred but large n required; permuting genes is common but can inflate type I error due to inter-gene correlations

0.14.2 Part 2:

When running gost, it carries out a functional enrichment analysis comparing all of the genes in my set to all of the genes annotated in humans to see whether our genes are enriched/expressed more than they would be by chance

Manhattan Plot of Functional Enrichment Analysis Using GO:

The circles that are the farthest up are the most enriched
TF:M11685_1 is most enriched (tallest blue dot); the rest are mostly interferon signaling

In upregulation module is:

> myModule
         PIC01    PIC02    PIC03     LF07      LF08      LF09
IFI27 4.185237 3.912628 3.370907 2.009563 1.9274367 2.0385980
IFI6  3.814688 3.502734 2.845341 1.037494 0.9473482 0.7926239
IFIT1 4.584897 4.468331 3.980659 2.404900 2.6235918 2.2917992

In downregulation module is:

> myModule
         PIC01    PIC02    PIC03       LF07       LF08         LF09
DISP2 2.788966 2.914997 2.739440  0.6086220  1.2242679  1.254558268
IFIT3 3.664853 3.334991 3.145716 -0.7127142  0.8762171 -0.004597556
IRF7  2.869427 2.969015 2.898255  1.6208349  1.7143503  1.472258952
ISG15 1.677950 1.701252 1.534261 -0.4579730 -0.4314073 -0.388628741

0.14.3 Part 3:

Downloaded GSEA reference gene datatset from MSigDB
Having issues with running GSEABase package
Downloaded GSEA app instead with [this link] (https://www.gsea-msigdb.org/gsea/login.jsp)

0.14.4 Part 4:

0.15 Running Scripts with Unstimulated and poly I:C data

Created new studydesign_unst.txt, new readMapping_UNST.sh file
Move into Bioinformatics_Research directory and run the following lines in terminal

conda activate rnaseq
cd raw_data
./readMapping_UNST.sh

Code taking too long to run, need to download screen

0.16 Running GO and GSEA Functional Enrichment Analysis on External Sites

Used https://geneontology.org for GO analysis
Could not figure out how to get resulting data from this website

Struggling with using GSEA app and getting input data in the correct form

0.17 Making Presentation Ready Figures

Edited violin plots

p3 <- ggplot(log2.cpm.filtered.norm.df.pivot) +
  aes(x=samples, y=expression, fill=samples) +
  geom_violin(trim = FALSE, show.legend = FALSE) +
  stat_summary(fun = "median", 
               geom = "point", 
               shape = 95, 
               size = 10, 
               show.legend = FALSE) +
  labs(y="Log2 Expression in Counts Per Million", x = "Samples",
       title="Log2 Counts per Million (CPM) Across Samples",
       subtitle="filtered, TMM normalized",
       caption=paste0("produced on ", Sys.time())) +
  theme_minimal()

p3 + scale_fill_manual(values=c("#E69F00", "#56B4E9", "#009E73",
                                     "#F0E442", "#0072B2", "#D55E00"))
                                     
plot_grid(p1 + scale_fill_manual(values=c("#E69F00", "#56B4E9", "#009E73",
                                          "#F0E442", "#0072B2", "#D55E00")), 
          p2 + scale_fill_manual(values=c("#E69F00", "#56B4E9", "#009E73",                                                "#F0E442", "#0072B2", "#D55E00")), 
          p3 + scale_fill_manual(values=c("#E69F00", "#56B4E9", "#009E73",
                                          "#F0E442", "#0072B2", "#D55E00")),              labels = c('A', 'B', 'C'), label_size = 12)

Edited graphs from PC analysis
Wanted to change labels but having trouble with ggplot command

Adjust Volcano Plot

ggplot(myTopHits.df) +
  aes(y=-log10(adj.P.Val), x=logFC, text = paste("Symbol:", geneID)) +
  geom_point(size=3.5,colour="#FF696D") +
  geom_hline(yintercept = 1.3, linetype="longdash", colour="black", size=1) +
  geom_vline(xintercept = 1, linetype="longdash", colour="black", size=1) +
  geom_vline(xintercept = -1, linetype="longdash", colour="black", size=1) +
  annotate("rect", xmin = 1, xmax = 4, ymin = 1.3, ymax = 1.8, alpha=.2, fill="grey") +
  #annotate("rect", xmin = -1, xmax = -12, ymin = -log10(0.01), ymax = 7.5, alpha=.2, fill="#2C467A") +
  labs(title="Volcano Plot of Gene Regulation",
       caption=paste0("produced on ", Sys.time())) +
  theme_minimal()

0.18 Creating Heatmap of Differentially Expressed Genes

Having trouble determining x and y for heatmap

> diffGenes.df
# A tibble: 7 × 7
  geneID PIC01 PIC02 PIC03   LF07   LF08     LF09
  <chr>  <dbl> <dbl> <dbl>  <dbl>  <dbl>    <dbl>
1 DISP2   2.79  2.91  2.74  0.609  1.22   1.25   
2 IFI27   4.19  3.91  3.37  2.01   1.93   2.04   
3 IFI6    3.81  3.50  2.85  1.04   0.947  0.793  
4 IFIT1   4.58  4.47  3.98  2.40   2.62   2.29   
5 IFIT3   3.66  3.33  3.15 -0.713  0.876 -0.00460
6 IRF7    2.87  2.97  2.90  1.62   1.71   1.47   
7 ISG15   1.68  1.70  1.53 -0.458 -0.431 -0.389

gg <- ggplot(diffGenes.df, aes(y=, x=PIC01, text = paste("Symbol:", geneID)))
gg <- gg + geom_tile(color='White', size=0.1) + scale_fill_viridis(name="Temperature") + coord_equal() + ggtitle("Differentially Expressed Genes Across Samples")
gg