R packages used
A variety of R packages was used for this analysis. All graphics and
data wrangling were handled using the tidyverse suite of packages. All
packages used are available from the Comprehensive R Archive Network
(CRAN), Bioconductor.org, or Github.
Course Overview
Goals:
- Learn to analyze your own bulk and single cell RNA-seq data
- Develop a lightweight and reusable RNA-seq pipeline
- Best practices for working in R/bioconductor
- Become a better data scientist
- Learn basic building blocks to create publication-quality
graphics
- Learn to report your work in a transparent and reproducible way
Course Path:
- Data import
- Data wrangling
- Data exploration
- Accessing public data
- Feature selection
- Finding modules
- Functional enrichment analysis
- Dynamic reports
Intro to RNAseq
technology and data
Part 1: Illumina’s
SBS
Shankar Balasubramanian and David Klenerman were scientists who
were interested in the use of fluorescently tagged nucleotides in RNA
sequencing. This led to illumina sequencing (‘sequencing by
synthesis’ technology).
There were many different sequencing techniques around 2012,
however, many became obsolete.
The MiniSeq, MiSeq, NextSeq, and NoveSeq all perform RNA
sequencing, however, each are capable of different runtimes and numbers
of reads. The wide array of sequencing technology options has led to the
democratization across organizations and institutions.
There has been a shift in time and money associated with
sequencing. Sequencing does not require much time anymore, so you are
able to make much larger and more complex experiments.
The major technological hurdle associated with illumina
sequencing was the inability to conduct single molecule sequencing at
the time.
‘Sequencing by Synthesis’ (SBS) is the process by which
a single molecule is taken, folded over to hybridize the primer, and
syntehsize a second reverse strand.
PCR colony (‘polony’) size and distance can be controlled easily
with SBS.
Clustering density is how close or far colonies are from
each other.
*** Overclustering is when there are colonies that are too
close to each other, and there are too many to collect. It is hard to
identify the data from a crowded slide.
*** Underclustering is when there are too few colonies for
collection. It is great data, but very sparse.
How the sequencing happens (Illumina
Sequencing Source): Once a polony is of an appropriate size, the
data collection begins.
- The DNA is broken up into small fragments and adaptors
(short DNA sequences) attach to the fragments.
- Through incubating the fragments with sodium hydroxide, the
double-stranded DNA fragments become single stranded.
- The DNA fragments are then washed across the flowcell
(sample cells), complementary DNA molecules stick to the flow cell, and
rest is washed away.
- The flow cell DNA is replicated to form clusters of DNA with the
same sequence. Each cluster emits a signal that can be picked up by a
camera when sequenced.
- Extra nucleotide bases are added which creates double-stranded
bridges of DNA between the primers and the flowcell base.
- Using heat, the DNA is made single-stranded resulting in multiple
clusters of identical single-stranded DNA sequences.
- Fluorescently tagged nucleotides are added to the flowcell and they
compete for addition to the growing chains. Primers are also added at
this time.
- The primer binds to the DNA via DNA polymerase, and adds one
fluorescently tagged nucleotides to the strand.
- Lasers activate the fluorescence of the flow cell, and each
nucleotide base emits a different color.
- The fluorescent nucleotide is removed and another is added.
Repeat.
Part 2: Study
Design
- You should always have an independent evaluation that your
experiment worked. We do not want the RNAseq data itself to be proof of
experimental success. We need to have validation that our experiment is
doing what we think it should be doing.
- We need to have controls for our study.
- What insight is to be gained from each treatment or condition? We
want to know what our “prize” analysis is, but also some secondary
comparisons that maybe are not as interesting. These secondary and
tertiary conditions help show that the experiment worked.
- What signal and noise are we expecting (signal-to-noise ratio)?
There could be contamination in the cells (noise) that could impact our
ability to detect true changes (signal).
- How many replicates do we need vs. how deep should I sequence? We
should compare the two, but we typically see that more replicates is
cheaper and often better than deeper sequencing. Shallow sequencing can
still reveal important biological patterns. If you do more than 12
replicates, you will have a good ability to detect truly expressed
genes. ‘Standard’ replicate amounts range from 2 to 3. The difference in
results of varying read depth are extremely small, but deeper reads are
better for alternative splicing.
- What is the cost? Library preparation and sequencing are the two
main factors.
- We are measuring gene expression in nucleated red blood cell
progenitors in the presence of poly I:C (a double stranded RNA that
mimics viral infection). Poly I:C (PIC) is not able to travel through
the membrane on its own, so we use Lipofectamine (LF) (a lipid carrier
that fuses with the plasma membrane) to carry it through to inside the
cell to trigger an immune signaling pathway.
- We have three replicates for when poly I:C and LF were present, and
when just LF was present (vehicle control).
- We conducted paired-end sequencing (R1, R2) meaning that it read
from the 5’ to 3’ end, and from the 3’ to 5’ end to increase our
confidence in the reads.
Setting up your
software environment
Part 2: Installing
software
- Download most current version of R
- RStudio provides is with an Integrated Development
Environment (IDE); it has a text-editor, workspace,
library/plots/file-browser/help, and terminal built-in
- R packages have multiple uses with a focus on stats, graphics, and
modeling
- You will always want to do any updates that it provides
- We will install Kallisto and other tools using Conda
- Conda is an open-source environment management system that is
critical for management of dependencies
- We need to configure our Conda channels as well
- You need to create your first Conda environment to download
Callisto
- Conda environments are siloed
- Downloaded Bioconductor, Conda, and all related dependencies
Ultra-fast Map
Reading with Kallisto
- Read mapping is the process of aligning the reads with a
reference genome
Part 1: FastQC
& FastP
- Try not to pollute the base environment and download packages in
separate environments
- Activate the environment with conda activate, then call the program
you want to use
- Do not unzip/decompress the files (too much space than your local
computer can hold)
- Read quality as our number of cycles increases
- Use fastqc to perform a quality assessment on our data
(fastqc *.gz -t 8) or fastp (fastp -i sample1.gz -t 8)
- HTML fastqc file can be opened in a web browser to view all of the
fastqc data
- Fastqc will give you plots and statistics; it may fail to
report
- We want duplicate sequences with rnaseq
- You do not have to run fastp (I had some issues with it)
- qc = quality control of our experimental mRNA data
Part 2: Kallisto
& MultiQC
- Downloaded DNA fasta file from [https://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/dna/]
(Homo_sapiens.GRCh38.cdna.all.fa.gz)
- Drop fasta file in sublime to view the data
- It contains multiple entries with the gene id, gene symbol,
transcription, and the gene itself
- We need to build an index (you only need to do it once)
- Be careful with file names! (& do not use spaces)
- Ran:
kallisto index -i Homo_sapiens.GRCh38.cdna.all.index Homo_sapiens.GRCh38.cdna.all.fa
- Found that it has 1037056 contigs (contiguous reads) and
contains 116726086 k-mers
- Creates a large index file as a result
- Mapping single-end data with:
kallisto quant \
-i Homo_sapiens.GRCh38.cdna.all.index \
-o test \
-t 8 \
--single -l 250 -s 30 \
fastq file name \
&> test.log
- Multi-QC demonstration: generates output and HTML files
- Multiqc summarizes outputs
multiqc -d .
- Mapping paired-end data with:
kallisto quant \
-i Homo_sapiens.GRCh38.cdna.all.index \
-o test \
-t 8 \
--single -l 250 -s 30 \
read 1 fastq file name \
read 2 fastq file name \
&> test.log
Part 3: Methods for
Quantifying Gene Expression
- Kallisto is for alignment (matches mRNA sequences with human DNA
genome)
- We want to automate many fastq files (index construction &
mapping)
- List all of the commands in a .sh file (shell script)
- Go to the command line and write all of the lines of the shell
script
- Download readMapping.sh file and drop it on Sublime; it gives you a
pipeline
- Read mapping is text pattern matching
- We have a tradeoff with building a index: computational complexity
(STAR)
- Burrows-Wheeler Transform (BWT) is used to sort characters
alphabetically in a file; it is the last column
- Sort rows lexicographically based on first column
- Suffix array allow you to search lexicographically
(align)
- Take a baseline sequence and find matches (less computationally
intensive), rather than a single letter at a time
- RSubread aligner = vote to see where alignment is most
likely; make alignment process much less computationally extensive
- Sailfish = exact alignment of short k-mers; not approximate
matching and extension part; shred the reads into k-mers; involves few
parameters
- Idea of alignment was abandoned for Kallisto:
- Create transcriptome de Brujin graph (T-BDG) using k-mers (31
nucleotides)
- Move through reference each time and ask about compatibility with
our transcripts
- There will be splits to show you compatible reads
- Instead of aligning, we find transcripts that are compatible with
the read
- Don’t even need to search for all possible k-mers
- You can skip between skip nodes and non-skip nodes
- For most reads, only two k-mers were hashed because all k-mers from
the entire read mapped to a continguous region in the T-DBG
- It is really fast; they can generate bootstraps with -b
which can help you model technical variance in dataset (we are not going
to do these bootstraps)
- Used the following code to run readMapping.sh (all of fastqc
files)
- It resulted in 1037056 contigs and 116726086 k-mers
chmod 777 readMapping.sh
./readMapping.sh
- Getting the following error for HTS0101, HTS0302, HTS0701, HTS0801,
and HTS0901.
Failed to process file HTS048_09_3C_LF_S9_L001_R1_001.fastq
uk.ac.babraham.FastQC.Sequence.SequenceFormatException: Ran out of data in the middle of a fastq entry. Your file is probably truncated
at uk.ac.babraham.FastQC.Sequence.FastQFile.readNext(FastQFile.java:187)
at uk.ac.babraham.FastQC.Sequence.FastQFile.next(FastQFile.java:129)
at uk.ac.babraham.FastQC.Analysis.AnalysisRunner.run(AnalysisRunner.java:77)
at java.base/java.lang.Thread.run(Thread.java:1623)
- Originally ran with single-end read mapping, but should have used
paired-end read mapping to demonstrate reading from 3’ to 5’ and 5’ to
3’ for completeness
- With single-end read mapping, reads get sloppy at the beginning and
end (paired-end avoids this)
- This is why quality control failed at per base sequence content
- To rerun with paired-end read mapping, I edited readMapping.sh, ran
conda activate, navigated to raw Data folder, and ran
./readMapping.sh.
Understanding RNAseq
Count Data
Part 1:
Preamble
- Downloaded step 1 script
- We have used command line tools to assess the quality of the raw
reads (quality control)
- Used Kallisto to take a fasta file and turn it into an indexed file
(constructing an index - makes it easier and faster to search
again)
- Used Kallisto to map our raw reads against the transcriptome to get
summarized read counts
- I do not have a tsv file from Kallisto - going to rerun to see if I
get one
Part 2: Measuring
Digital Gene Expression
- Effective length of transcript is taking the actual length
- length of the fragment + 1 (length of a transcript after adjusting for
the total number of possible positions a fragment of size X could
originate from)
- Transcript is from the transcriptome & fragment is one of our
reads
- RNAseq gives a relative measure of gene expression
- Techniques used to measure mRNA abundance and reads for RNAseq data
are used to estimate a statistic that is as closely proportional to the
relative molar concentration as possible
- Understanding the units for measurement for RNAseq is critical to
understanding how we determine differential expression
- Normalization = any number of ways that a dataset is
globally altered to improve our ability to detect differently expressed
genes
- What happens to our data when we tweak it in different ways?
- The gene that is the lowest expressed has a low SD & vice
versa
- This is problem because it makes it harder to pull these highly
expressed genes out (heterscedasticity)
- Convert raw counts with a log 2 conversion, to make the most lowly
expressed gene have the highest SD & vice versa
- With large datasets, fixing one problem globally creates another
(check your assumptions and fit them to your data)
- It is better to represent RNAseq data in a log 2 scale
- The expected reads per sample would be even for all samples
(the world is not this way, will be unbalanced) - between samples;
adjusts for the length of feature
- Scaling units for within-sample comparisons - different
regions of may be different sizes; adjusts for depth of sequencing of
sample
- reads per kilobase, per million reads (RPKM) = between
sample + within sample; unit of expression for RNAseq data; (# of reads
mapped to a genomic region) / [(region length in kb)*(total # reads)] *
10^6
- RPKM is no longer the accepted unit; no sufficient for
normalization
- The problem with RPKM is that the average relative molar
concentration for each and every sample of RNAseq data mapped to the
same genome is the same value - averages will be very very different
between samples
- Fixing RPKM is easy - scale by gene length first and create the
transcript per million (TPM) = (reads per Kn) / (total RPK in
sample); now the TPMs all total to 1
- Further normalization steps need to be taken before one can compare
between samples
- Library size scaling is too simple for many biological
applications
- Scaling will all be messed up because of the number of original
genes matching to each one
- Most current normalization methods attempt to find a set of genes
with minimal variance across samples for normalization
- Statistical tools make inherent assumptions about your data
- Most genes are not differentially expressed in your dataset
(implications for comparing very different treatments/conditions)
- Approximating equivalent numbers of up and down regulated genes
Part 3: Starting
Our R Project and Step 1 Script
- When you start a new R project, it starts a new R Session, the
.Rprofile in the project’s main directory, access commands in .Rhistory
file
- R script != to RStudio prject (can have many R scripts associated
with it)
- Edit -> Folding -> Collapse All to break down script into
code chunks (foldable; add “—” after title)
- Edit -> Folding -> Expand All to undo it
- RStudio is text-aware
- Click install on right-hand pane under “Packages”
- Run lines of code by highlighting them
setRepositories() # installs all R packages
1 2 3 4 5 6 7 8
Starting Your R
Workflow
Part 1: Starting
Step 1 Script
- Tidyverse is an environment of R packages
?read_tsv gives you manual page (help
documentation)
- If you highlight an object and click Control+Enter, it will show you
its contents
setRepositories() before installing R packages
- Automate file paths
- Downloaded studydesign.txt from data section and updated it with
appropriate folder names
- My
all(file.exists(path)) only produces one true - but
it ends up working
Part 2: Tapping
Into Annotation Databases and Reading Kallisto Data into R
- Want to map transcription level expression data to gene level
expression data (need to have one-to-one mapping of transcript ids to
gene ids) - this is because a transcript is not helpful unless we know
which gene it is from
- BiomaRt allows users to connect to a database of many different
annotations
- You can pull out all of the gene level annotations for a dog
(ex.)
- A tibble is a tidied-up data frame
- With non-model organisms, there are often fewer gene symbols than
transcript identifiers available
- Toggling txOut to TRUE will allow you to change the ordering of
genes to transcripts
- We used the tximport package to read our Kallisto outputs into the R
environment
- We used EnsemblDB to attach gene level information to our
transcription data (transfers our data to gene level)
Wrangling Gene
Expression Data
Part 1: Starting
Step 2 Script
- The Step 2 Script will be used filter our data and get normalized
read counts
- R Data Types:
- vector = series of objects
- matrix = has rows & columns; will have the same kind of
data
- array = several matrices stacked together
- dataframe = matrix but can have mix of types
- list = any combination of the above data types
- Our transcript list will be far bigger than our gene list
- The # of elements is the number of data points in the matrix
- There are a lot of transcripts with no evidence of gene expression
data
- “Counts” are the actual number of reads matching to that gene in
that sample (it is an element in our list)
- “Abundance” refers to the number of reads present
- Sum of our columns in myTPM should be close to 1 million (shows how
many of those transcripts are gene maps) - transcripts per million
refers to transcripts not genes
colSums(myTPM)
[1] 912920.2 910074.9 911529.2 909853.9 909855.2 909729.7
- Sum of our columns in gene level data does not sum to 1 million
(myTPM_Counts)
colSums(myTPM_Counts)
[1] 48423530 59355199 43597530 52757379 53130928 52197283
- Use
$ to get a column of a data frame
Part 2: Walking
Through the Step 2 Script, & Relating Our Work to the ‘Grammar of
Graphics’ and ‘Tidy’ Data
- It is extremely easy to manipulate a data matrix (calculate standard
deviation, average, and median)
- ggplot() creates a graph using layers (you can stack these - styles)
- give it the data you want it to graph and give it the aesthetics
(‘grammar graphics’)
myTPM.stats <- transform(myTPM,
SD=rowSds(myTPM),
AVG=rowMeans(myTPM),
MED=rowMedians(myTPM))
ggplot(myTPM.stats) +
aes(x = SD, y = MED) +
geom_point(shape=25, size=3)
- There is heteroskedasticty across our data (unequal variance across
median values) - representing our data on a log2 scale helps with this
(impacts of count data)
- ggplot2 shape parameter = menu for type of shape for data
points
Summary Stats (Median v. SD) For Each Transcript or
Gene:
- Use geom_hex instead of geom_point to represent
data as a hexagonal bin (heatmap) - shows relative expression of a
certain gene
ggplot(myTPM.stats) +
aes(x = SD, y = MED) +
geom_hex(shape=25, size=3)
- You can specialize themes of your ggplot with various theme
packages
- ggplot are plots on plots (layered)
- Benefits of ggplots:
- Users can iteratively update a plot, changing specific elements as
needed
- Even high-level aspects of a plot can be changed
- Layers allow plots to be overplayed
- Layers also make it easy to use # to toggle layers on/off
- Default arguments for each layer minimize the need for users to
dictate details
- We make a DGElist (data structure) with our counts - has two
elements, 1. counts and matrix of sample labels, 2. samples with group
library size and normalization factors (good to know how large our files
are)
- We utilize counts per million instead of TPM because it
shows how many times we hit this particular gene, rather than reading a
lot of short fragments that don’t really align (noise)
Counts Per Million on Log2 Scale:

- We will coerce our data matrix (respects row names) to be a data
frame (does not respect row names - we need to save these as a column
gene idea)
- When we wrangle our data, we will often lose our sample labels, so
we will need to add them back in
- We will now work to filter our data to reduce any
noise
- Tidy data philosophy = “A dataset is messy or tidy (long)
depending on how rows, columns and tables are matched up with
observations, variables, and types.”
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.
- Gene expression data is always messy (wide) - we need to take every
variable and make it form a column
- Tidy data is ideal for graphing - values are easily retrieved as a
single vector
- We can plot the statistical summary of our data - plot every sample
on the x-axis and the log2 on the y-axis
Filtered Data:

- The distribution of the data for each sample, and we were able to
bin the data by sample
- It would be unusual if there was a lot of variation among
samples
- You can flip the coordinates on the sample by adding
coord_flip()
- We are trying to make a logical matrix of all of the genes
- 12229 / 35371 gene samples have no counts at all (great!) - you can
summarize this with the table command
- We now set the cutoff for the number of samples in the smallest
group of comparison & filter it
- We are using subsetting to tell it which rows or columns we
want to remove or keep
- The following result means that we removed 11752 rows of our data
that did not have more than one count per million (cpm) in at least 3 of
the samples
keepers <- rowSums(cpm>1)>=3
myDGEList.filtered <- myDGEList[keepers,]
dim(myDGEList.filtered)
[1] 11752 6
- We now normalize our data to make them have consistent
units
- The normalization factors are now not all 1 because it is supposed
to control for differences in distributions (tweaked up and down)
- We need to normalize our data by applying a normalization method
“TMM”
- Our normalization of the data does not look very different from the
non-normalized data meaning that we have lots of data and reads
log2.cpm.filtered.norm <- cpm(myDGEList.filtered.norm, log=TRUE)
log2.cpm.filtered.norm.df <- as_tibble(log2.cpm.filtered.norm, rownames = "geneID")
colnames(log2.cpm.filtered.norm.df) <- c("geneID", sampleLabels)
Normalized Data:

- Today, we took plotted the raw, filtered, and normalized data and
evaluated their differences.
Combined Graphs:

Data Exploration
Part 1: Discussion
of multivariate data, dimensional reduction via PCA, and starting our
Step 3 script
- Tidyverse packages and functions only work with data frames
- The readr package always produces a data frame
- tibble = data frame with nicer display and subsetting
- piping = makes code more succinct and readable
(%>%)
- When it comes to sources of variance, not all experiments are
created equal
- As you increase the number of variables, you increase the variance
(the type of experiment you do will influence your sources of variance
and the types of data you need to record)
- examples of sources of variance: people conducting it, batch prep,
variable confounded with variable of interest, sex (did we check for
this?)
- You need to record every variable in order to visualize the variance
of our data
- Dimensional reduction = reducing number of variables; PCA
is a method of this
- Principle Component Analysis (PCA) = find a linear
combination of your variables that captures the largest spread
(variance) in your cloud of data points (line of best fit); reduces
number of dimensions in data set; you then draw a line perpendicular to
that one (will account for less of variance than the first line);
unsupervised learning method which has no information about sample
labels
- Running studydesign.txt file which tells us what PIC and LF are
- We pull out the virus/healthy determinations and identify that these
are the 2 classes with
factor(group)
- Hierarchical clustering only works on matrices; find the
distance between points and cluster them (produces a diagonal matrix,
shows distance of every point to another); produces a tree
(dendogram)
- right/left has no meaning in these kinds of trees, only vertical
distance matters (reflects distance in distance matrix)
- It is really interesting that the LF and PIC samples are not
clustering together respectively
- Is LF08 really an outlier? - maybe has the fewest of all reads
aligning (may be technical reason why it is different and not
biological)
Cluster Dendogram of Raw Data

- The results below show that bulk of our data is explained by PC1, 2,
3, and 4
summary(pca.res) # Prints variance summary for all principal components.
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6
Standard deviation 12.834 12.4688 12.1273 11.0053 10.1853 6.316e-14
Proportion of Variance 0.238 0.2246 0.2125 0.1750 0.1499 0.000e+00
Cumulative Proportion 0.238 0.4626 0.6751 0.8501 1.0000 1.000e+00
- rotation = how much each gene influenced each PC
- x = how much each sample influenced each PC.
Variance by Each PC

- The following results show the variance per PC, this means that our
data is pretty spread out
> pc.per
[1] 23.8 22.5 21.2 17.5 15.0 0.0
Part 2: Plotting
PCA results and small multiples
- The data has both magnitude and direction - shows you how much each
sample contributed to each PC
Original PCA Plot

- You can use objects as axis labels
- We now add colors to make our graph more informative
PCA Plot Colored

- This graph does not really give meaning to PC1 (does not graph by LF
or PIC)
- Additionally, the function stated there were two few data points in
order to accurately cluster the samples
PCA Plot Labeled

- ‘Small multiples’ plot = similar graphs using the same axes
and labels; make it easy to spot interesting combinations of variables;
shows magnitude and direction
- pipe function =
%>% (Magrittr function as
part of tidyverse); streamlines our code; runs multiple functions on
successive code in one line
- A tibble allows us to view our data as a single numeric vector
PCA Small Multiples Plot

- None of our small multiple plots of PCs tell us about healthy vs
virus
- PC2 and PC4 tells us about outliers
- Take the time to thoroughly explore your data before downstream
analysis
- .fastq files have a lot of information in the headers
Part 3: Producing
interactive tables and plots
- dplyr = package used to mutate (columns), filter (rows),
select (columns), and arrange (rows) things in data frames
- gt package is great for making tables
Gene Expression in the Presence vs Absence of Poly I:C
Formatted

- use ggplot to make interactive (tooltip) plots
LogFC Table Formatted

- These plots do not tell us anything about significance however
- This graph shows that the same genes tend to be expressed in the
presence and absence of poly I:C
- It is possible that I did not choose genes involved in the immune
response but are rather expressed all of the time
Accessing Public
Data
Part 1: Accessing
Public RNA-seq Data
- Sequence Read Archive (SRA) = NCBI public sequencing
database
- You can directly download the .fastq files from there
- There are SRA tools for the command line
- Gene Expression Omnibus (GEO) = data generation outpaces
analysis
- Community-driven projects have helped to standardize sample
collection, storage and data generation
- Tools for rapid and large-scale of public RNA-seq data have lagged
behind
- Hierarchical Data Format v5 (HDF5) = self-describing file
format; uses groups and datasets instead of folders and files; perfect
for storing genetic expression data; have different types of data in
same file
- ARCHS4 = “All RNA-seq and ChIP-seq sample sequence search”;
scales resources to massive alignment
Part 2: Step 4
Script to access the ARCHS4 Database
- Had trouble downloading the human and mouse sample files
- hdf5 file = has groups and datasets; contains
metadata;
- Samples from the same study are grouped together in the ARCHS4
Database
- You can create your own studydesign file with ARCHS4 (with sample
title, sample source, and sample characteristics)
- Use it to determine the relationship between your PCs and the
metadata
- Will go back in later with results once sample files finish
downloading
Checking Viability
of Results
- Since the photos produced above show that there is little
differentiation between the PIC and LF samples, we re-ran scripts 1, 2,
and 3 above to ensure that this is correct
- We reran them because based on our hypothesis we would expect our
control and experimental conditions to produce varying results in gene
expression, however, this is not what we see
- Following making PCA plots for filtered & unnormalized data,
unfiltered & normalized data, and unfiltered & unnormalized data
(and with Rebecca’s checking), we found that there are indeed many
similarities between the two conditions
- As a result, I will continue to carry out the following analyses
with this in mind
- I reran the scripts to ensure it was filtered and normalized
- I filtered for these specific immune genes & there were no
results. This may mean that there was no expression of these genes.
geneID=="MMP1" | geneID=="GZMB" | geneID=="IL1B" | geneID=="GNLY" | geneID=="IFNG" | geneID=="CCL4" | geneID=="PRF1" | geneID=="APOBEC3A" | geneID=="UNC13A"
Differential Gene
Expression
Part 1:
Introduction to Differential Feature Selection in R/Bioconductor
- We have been preparing for DGE analysis all along
- Kallisto can do sequencing so quickly through bootstrapping -
repeated bootstrapping can give insight into uncertainty for counts of
RNAseq data
- DGE = differential gene expression
- DTE = differential transcript expression
- DTU = differential transcript usage (what percentage of
output is due to specific isoforms - isoform switching)
- Genes can be more accurately measured than transcripts (compare
actual TPM to estimated TPM)
- If you aggregate transcript-level data to gene-level data, you get
more accurate counts of gene expression
Part 2: Starting
Step 5 Script for DGE Analysis
- design matrix = choice of comparisons is not inherent from
studydesign file; explicit about what experimental design is
- model matrix = helps you set up how you are going to apply
a linear model; assign 1 or 0 according to PIC or LF
- RNAseq data has an unequal variance/range across it
(heteroskedasticity)
- Voom = package that models mean-variance relationship &
will stabilize your data
- We are fitting a linear model to our data to view the mean-variance
trend
- The genes that are more differentially expressed have a lower
standard deviation - stats.

Bayesian Stats:

Part 3:
Visualizing DGEs
- Volcano plot = representation of impact a
treatment/condition that a pairwise comparison has on our dataset

- Our Volcano plot does not look like his - things to the right of 0
are upregulated in virus genes, things to the left are healthy
genes
- Genes with low p-values are very high up
- We can set arbitrary cutoffs and visualize them

- I adjusted the cutoffs to highlight the 7 genes that were
upregulated by the presence of poly I:C

- We can also pull out the DEGs - slightly odd to me that we only had
7 genes that were differentially expressed in the presence of
Poly:IC
- Same level of expression at 0 between 2 conditions
- More in Poly I:C condition to right of 0 - which is what we see
here
- We only care about things that are really highly up-regulated
(spread out further on X)
- Vertically on the graph shows p-value (significance)
- Cutoff p-value of 2 is too high –> changed to 1.3
- Not a lot of genes turning off - common for viral immune
response
> summary(results)
infection
Down 0
NotSig 11745
Up 7
- Had to review some code and ensure I changed disease to virus for
studydesign purposes
Venn Diagram of DEGs:

Data Table of DEGs:

- ISGs are interferon stimulated genes
- Look uo the genes and understand more about what they do, especially
DISP
- ISG15 = “interferon-stimulated gene”; known role in antiviral immune
response
- IRF7 = “interferon regulatory factor 7”; transcriptional activation
of virus-inducible cellular genes; master regulator of IFN-I immune
responses
- IFIT3 = “interferon-stimulated gene 49”; stimulates IFN production
during RNA virus infections
- IFIT1 = antiviral protein that recognizes 5′-triphosphate RNA
- IFI6 = IFN-stimulated gene (ISG) whose expression is highly
regulated by the stimulation of type I IFN-alpha that restricts various
kinds of virus infections by targeting different stages of the viral
life cycle
- IFI27 = counteracts innate immune responses after viral infections
by interfering with RIG-I signaling
- DISP2 = one of two human homologs of a segment-polarity gene known
as dispatched identified in Drosophila; product of this gene may be
required for normal Hedgehog (Hh) signaling during embryonic pattern
formation
- A Headgehog pathway is a signaling pathway that transmits
information to embryonic cells required for proper cell
differentiation
- Overall, the presence of poly I:C upregulated antiviral
immune-related genes
Module
Identification
Part 1: Intro to
Clustering and Starting the Step 6 Script
- Clustering = group genes or transcripts together based on
their behavior across the samples we measure the expression in; results
in modules of expressed genes
- We need to separate out genes that are behaving in very different
ways
- We are creating a character vector of colors (Rcolorbrewer)
- Having trouble creating diffgenes - need to re-check DEGList
- Changed
dim(diffgenes) to
dim(diffgenes.df)
- Replace “Data” section with (just use diffGenes from Step5
script):
results <- decideTests(ebFit, method="global", adjust.method="BH", p.value=0.01, lfc=2)
colnames(v.DEGList.filtered.norm$E) <- sampleLabels
dim(diffGenes)
Part 2: Making and
Interpreting Heatmaps
- Making a heatmap out of a data matrix
- Make a distance matrix based on correlations of our genes
- First, cluster genes (rows) in each set of differentially expressed
genes (need to find pairwise correlations of our genes)
- Secondly, cluster our samples (columns)
- Colorblindness matters when creating heatmaps, be aware
- scale = ‘row’; heatmpa according to row z-scores; puts emphasis on
comparing between samples
- scale = ‘none’; heatmap according to value; puts emphasis on
comparing between genes (will skew image so you view most highly
expressed genes
Cluster Heatmap:

Cluster Interative Heatmap:

- We now want to pull out modules of co-expressed genes
(genes that are upregulated or downregulated)
- Changed code below to 1:7 instead of 1:2307
module.assign.pivot <- pivot_longer(module.assign.df, # dataframe to be pivoted
cols = 1:7, # column names to be stored as a SINGLE variable
names_to = "geneID", # name of that new variable (column)
values_to = "module") # name of new variable (column) storing all the values (data)
Heatmap for Each Subcluster:

Part 3:
Alternative Clustering Methods
- Clustering:
- Unsupervised = blind to groups (hierarchical, k-means,
etc.); assign all genes to clusters (data partioning); genes can only
belong to one module (binary)
- Supervised = group aware (random forest, neural nets,
etc.)
- By adding a gene that does not fit into a module, we are creating
false positives
- Modules can differ dramatically based on clustering method used
- Clust generates “tight” modules; command-line tool
Functional
Enrichment Analysis
Part 1:
Introduction to Enrichment Analysis Methods
- Functional Enrichment Analysis = go from a collection of
genes that have patterns to them, to understanding what the genes do;
“pathway analysis”
- 2 main approaches:
- Gene Ontology (GO) = structured vocabulary for describing
gene function; requires arbitrary selection of genes; does not require
any data (gene identifiers only); species specific; user doesn’t control
vocabulary; manually curated; answers where a gene product is localized,
what functions it performs in that location, and how it carries out that
function; redundant and often requires summarization of results
- Gene Set Enrichment Analysis (GSEA) = no need to pre-select
genes; requires both identifiers and data; not species specific; user
can control vocabulary; like counting cards; 1) ranks rows/transcripts
in order of magnitude of change (up/downregulation), 2) rank order
genes; 2 types: self-contained (are any of the genes in my set
differentially expressed?) and competitive (are genes in set
more commonly differentially expressed than genes in set?)
- By clustering the genes in modules, we improve our ability to detect
functionaly enriched themes in these genes
- Think about why you set out to do an RNASeq analysis in the first
place (regulation of gene expression under our conditions)
- MSigDB = curated signature database
- Permutation = must be conducted for GSEA to determine if an
enrichment score is actually statistically significant; permuting
samples is preferred but large n required; permuting genes is common but
can inflate type I error due to inter-gene correlations
Part 2:
- When running gost, it carries out a functional enrichment
analysis comparing all of the genes in my set to all of the genes
annotated in humans to see whether our genes are enriched/expressed more
than they would be by chance
Manhattan Plot of Functional Enrichment Analysis Using
GO:

- The circles that are the farthest up are the most enriched
- TF:M11685_1 is most enriched (tallest blue dot); the rest are mostly
interferon signaling

- In upregulation module is:
> myModule
PIC01 PIC02 PIC03 LF07 LF08 LF09
IFI27 4.185237 3.912628 3.370907 2.009563 1.9274367 2.0385980
IFI6 3.814688 3.502734 2.845341 1.037494 0.9473482 0.7926239
IFIT1 4.584897 4.468331 3.980659 2.404900 2.6235918 2.2917992

- In downregulation module is:
> myModule
PIC01 PIC02 PIC03 LF07 LF08 LF09
DISP2 2.788966 2.914997 2.739440 0.6086220 1.2242679 1.254558268
IFIT3 3.664853 3.334991 3.145716 -0.7127142 0.8762171 -0.004597556
IRF7 2.869427 2.969015 2.898255 1.6208349 1.7143503 1.472258952
ISG15 1.677950 1.701252 1.534261 -0.4579730 -0.4314073 -0.388628741

Part 4:
Running Scripts with
Unstimulated and poly I:C data
- Created new studydesign_unst.txt, new readMapping_UNST.sh file
- Move into Bioinformatics_Research directory and run the following
lines in terminal
conda activate rnaseq
cd raw_data
./readMapping_UNST.sh
- Code taking too long to run, need to download screen
Running GO and GSEA
Functional Enrichment Analysis on External Sites

- Struggling with using GSEA app and getting input data in the correct
form
Making Presentation
Ready Figures
p3 <- ggplot(log2.cpm.filtered.norm.df.pivot) +
aes(x=samples, y=expression, fill=samples) +
geom_violin(trim = FALSE, show.legend = FALSE) +
stat_summary(fun = "median",
geom = "point",
shape = 95,
size = 10,
show.legend = FALSE) +
labs(y="Log2 Expression in Counts Per Million", x = "Samples",
title="Log2 Counts per Million (CPM) Across Samples",
subtitle="filtered, TMM normalized",
caption=paste0("produced on ", Sys.time())) +
theme_minimal()
p3 + scale_fill_manual(values=c("#E69F00", "#56B4E9", "#009E73",
"#F0E442", "#0072B2", "#D55E00"))
plot_grid(p1 + scale_fill_manual(values=c("#E69F00", "#56B4E9", "#009E73",
"#F0E442", "#0072B2", "#D55E00")),
p2 + scale_fill_manual(values=c("#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00")),
p3 + scale_fill_manual(values=c("#E69F00", "#56B4E9", "#009E73",
"#F0E442", "#0072B2", "#D55E00")), labels = c('A', 'B', 'C'), label_size = 12)




- Edited graphs from PC analysis
- Wanted to change labels but having trouble with ggplot command


ggplot(myTopHits.df) +
aes(y=-log10(adj.P.Val), x=logFC, text = paste("Symbol:", geneID)) +
geom_point(size=3.5,colour="#FF696D") +
geom_hline(yintercept = 1.3, linetype="longdash", colour="black", size=1) +
geom_vline(xintercept = 1, linetype="longdash", colour="black", size=1) +
geom_vline(xintercept = -1, linetype="longdash", colour="black", size=1) +
annotate("rect", xmin = 1, xmax = 4, ymin = 1.3, ymax = 1.8, alpha=.2, fill="grey") +
#annotate("rect", xmin = -1, xmax = -12, ymin = -log10(0.01), ymax = 7.5, alpha=.2, fill="#2C467A") +
labs(title="Volcano Plot of Gene Regulation",
caption=paste0("produced on ", Sys.time())) +
theme_minimal()

Creating Heatmap of
Differentially Expressed Genes
- Having trouble determining x and y for heatmap
> diffGenes.df
# A tibble: 7 × 7
geneID PIC01 PIC02 PIC03 LF07 LF08 LF09
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 DISP2 2.79 2.91 2.74 0.609 1.22 1.25
2 IFI27 4.19 3.91 3.37 2.01 1.93 2.04
3 IFI6 3.81 3.50 2.85 1.04 0.947 0.793
4 IFIT1 4.58 4.47 3.98 2.40 2.62 2.29
5 IFIT3 3.66 3.33 3.15 -0.713 0.876 -0.00460
6 IRF7 2.87 2.97 2.90 1.62 1.71 1.47
7 ISG15 1.68 1.70 1.53 -0.458 -0.431 -0.389
gg <- ggplot(diffGenes.df, aes(y=, x=PIC01, text = paste("Symbol:", geneID)))
gg <- gg + geom_tile(color='White', size=0.1) + scale_fill_viridis(name="Temperature") + coord_equal() + ggtitle("Differentially Expressed Genes Across Samples")
gg