0.1 R packages used

A variety of R packages was used for this analysis. All graphics and data wrangling were handled using the tidyverse suite of packages. All packages used are available from the Comprehensive R Archive Network (CRAN), Bioconductor.org, or Github.


0.2 Course Overview

Goals:

  • Learn to analyze your own bulk and single cell RNA-seq data
  • Develop a lightweight and reusable RNA-seq pipeline
  • Best practices for working in R/bioconductor
  • Become a better data scientist
  • Learn basic building blocks to create publication-quality graphics
  • Learn to report your work in a transparent and reproducible way

Course Path:

  1. Data import
  2. Data wrangling
  3. Data exploration
  4. Accessing public data
  5. Feature selection
  6. Finding modules
  7. Functional enrichment analysis
  8. Dynamic reports

0.3 Intro to RNAseq technology and data

0.3.1 Part 1: Illumina’s SBS

  • Shankar Balasubramanian and David Klenerman were scientists who were interested in the use of fluorescently tagged nucleotides in RNA sequencing. This led to illumina sequencing (‘sequencing by synthesis’ technology).

  • There were many different sequencing techniques around 2012, however, many became obsolete.

  • The MiniSeq, MiSeq, NextSeq, and NoveSeq all perform RNA sequencing, however, each are capable of different runtimes and numbers of reads. The wide array of sequencing technology options has led to the democratization across organizations and institutions.

  • There has been a shift in time and money associated with sequencing. Sequencing does not require much time anymore, so you are able to make much larger and more complex experiments.

  • The major technological hurdle associated with illumina sequencing was the inability to conduct single molecule sequencing at the time.

  • ‘Sequencing by Synthesis’ (SBS) is the process by which a single molecule is taken, folded over to hybridize the primer, and syntehsize a second reverse strand.

  • PCR colony (‘polony’) size and distance can be controlled easily with SBS.

  • Clustering density is how close or far colonies are from each other.

    *** Overclustering is when there are colonies that are too close to each other, and there are too many to collect. It is hard to identify the data from a crowded slide.

    *** Underclustering is when there are too few colonies for collection. It is great data, but very sparse.

  • How the sequencing happens (Illumina Sequencing Source): Once a polony is of an appropriate size, the data collection begins.

    1. The DNA is broken up into small fragments and adaptors (short DNA sequences) attach to the fragments.
    2. Through incubating the fragments with sodium hydroxide, the double-stranded DNA fragments become single stranded.
    3. The DNA fragments are then washed across the flowcell (sample cells), complementary DNA molecules stick to the flow cell, and rest is washed away.
    4. The flow cell DNA is replicated to form clusters of DNA with the same sequence. Each cluster emits a signal that can be picked up by a camera when sequenced.
    5. Extra nucleotide bases are added which creates double-stranded bridges of DNA between the primers and the flowcell base.
    6. Using heat, the DNA is made single-stranded resulting in multiple clusters of identical single-stranded DNA sequences.
    7. Fluorescently tagged nucleotides are added to the flowcell and they compete for addition to the growing chains. Primers are also added at this time.
    8. The primer binds to the DNA via DNA polymerase, and adds one fluorescently tagged nucleotides to the strand.
    9. Lasers activate the fluorescence of the flow cell, and each nucleotide base emits a different color.
    10. The fluorescent nucleotide is removed and another is added. Repeat.

0.3.2 Part 2: Study Design

  1. You should always have an independent evaluation that your experiment worked. We do not want the RNAseq data itself to be proof of experimental success. We need to have validation that our experiment is doing what we think it should be doing.
  2. We need to have controls for our study.
  3. What insight is to be gained from each treatment or condition? We want to know what our “prize” analysis is, but also some secondary comparisons that maybe are not as interesting. These secondary and tertiary conditions help show that the experiment worked.
  4. What signal and noise are we expecting (signal-to-noise ratio)? There could be contamination in the cells (noise) that could impact our ability to detect true changes (signal).
  5. How many replicates do we need vs. how deep should I sequence? We should compare the two, but we typically see that more replicates is cheaper and often better than deeper sequencing. Shallow sequencing can still reveal important biological patterns. If you do more than 12 replicates, you will have a good ability to detect truly expressed genes. ‘Standard’ replicate amounts range from 2 to 3. The difference in results of varying read depth are extremely small, but deeper reads are better for alternative splicing.
  6. What is the cost? Library preparation and sequencing are the two main factors.
  • We are measuring gene expression in nucleated red blood cell progenitors in the presence of poly I:C (a double stranded RNA that mimics viral infection). Poly I:C (PIC) is not able to travel through the membrane on its own, so we use Lipofectamine (LF) (a lipid carrier that fuses with the plasma membrane) to carry it through to inside the cell to trigger an immune signaling pathway.
  • We have three replicates for when poly I:C and LF were present, and when just LF was present (vehicle control).
  • We conducted paired-end sequencing (R1, R2) meaning that it read from the 5’ to 3’ end, and from the 3’ to 5’ end to increase our confidence in the reads.

0.4 Setting up your software environment

0.4.1 Part 1: Software tools

  • Compute resources:
    1. Lab workstation - unrestricted access, high costs
    2. The cloud - scalable, easy set-up, data transfer, not secure, potential charges
    3. Compute cluster- scalable, highly cost-efficient, secure
    • R/bioconductor = free & open-source, don’t need to write code de novo, huge user community, reprodcubile results, modular & standardized, not only transcriptomics, publication quality graphics, no single ‘right way’ to do anything, lacking in terms of user interface

0.4.2 Part 2: Installing software

  • Download most current version of R
  • RStudio provides is with an Integrated Development Environment (IDE); it has a text-editor, workspace, library/plots/file-browser/help, and terminal built-in
  • R packages have multiple uses with a focus on stats, graphics, and modeling
  • You will always want to do any updates that it provides
  • We will install Kallisto and other tools using Conda
  • Conda is an open-source environment management system that is critical for management of dependencies
  • We need to configure our Conda channels as well
  • You need to create your first Conda environment to download Callisto
  • Conda environments are siloed
  • Downloaded Bioconductor, Conda, and all related dependencies

0.5 Ultra-fast Map Reading with Kallisto

  • Read mapping is the process of aligning the reads with a reference genome

0.5.1 Part 1: FastQC & FastP

  • Try not to pollute the base environment and download packages in separate environments
  • Activate the environment with conda activate, then call the program you want to use
  • Do not unzip/decompress the files (too much space than your local computer can hold)
  • Read quality as our number of cycles increases
  • Use fastqc to perform a quality assessment on our data (fastqc *.gz -t 8) or fastp (fastp -i sample1.gz -t 8)
  • HTML fastqc file can be opened in a web browser to view all of the fastqc data
  • Fastqc will give you plots and statistics; it may fail to report
  • We want duplicate sequences with rnaseq
  • You do not have to run fastp (I had some issues with it)
  • qc = quality control of our experimental mRNA data

0.5.2 Part 2: Kallisto & MultiQC

  • Downloaded DNA fasta file from [https://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/dna/] (Homo_sapiens.GRCh38.cdna.all.fa.gz)
  • Drop fasta file in sublime to view the data
  • It contains multiple entries with the gene id, gene symbol, transcription, and the gene itself
  • We need to build an index (you only need to do it once)
  • Be careful with file names! (& do not use spaces)
  • Ran:
kallisto index -i Homo_sapiens.GRCh38.cdna.all.index Homo_sapiens.GRCh38.cdna.all.fa
  • Found that it has 1037056 contigs (contiguous reads) and contains 116726086 k-mers
  • Creates a large index file as a result
  • Mapping single-end data with:
kallisto quant \
-i Homo_sapiens.GRCh38.cdna.all.index \
-o test \ 
-t 8 \
--single -l 250 -s 30 \
fastq file name \ 
&> test.log
  • Multi-QC demonstration: generates output and HTML files
  • Multiqc summarizes outputs
multiqc -d .
  • Mapping paired-end data with:
kallisto quant \
-i Homo_sapiens.GRCh38.cdna.all.index \
-o test \ 
-t 8 \
--single -l 250 -s 30 \
read 1 fastq file name \ 
read 2 fastq file name \
&> test.log

0.5.3 Part 3: Methods for Quantifying Gene Expression

  • Kallisto is for alignment (matches mRNA sequences with human DNA genome)
  • We want to automate many fastq files (index construction & mapping)
  • List all of the commands in a .sh file (shell script)
  • Go to the command line and write all of the lines of the shell script
  • Download readMapping.sh file and drop it on Sublime; it gives you a pipeline
  • Read mapping is text pattern matching
  • We have a tradeoff with building a index: computational complexity (STAR)
  • Burrows-Wheeler Transform (BWT) is used to sort characters alphabetically in a file; it is the last column
  • Sort rows lexicographically based on first column
  • Suffix array allow you to search lexicographically (align)
  • Take a baseline sequence and find matches (less computationally intensive), rather than a single letter at a time
  • RSubread aligner = vote to see where alignment is most likely; make alignment process much less computationally extensive
  • Sailfish = exact alignment of short k-mers; not approximate matching and extension part; shred the reads into k-mers; involves few parameters
  • Idea of alignment was abandoned for Kallisto:
  1. Create transcriptome de Brujin graph (T-BDG) using k-mers (31 nucleotides)
  2. Move through reference each time and ask about compatibility with our transcripts
  3. There will be splits to show you compatible reads
  4. Instead of aligning, we find transcripts that are compatible with the read
  5. Don’t even need to search for all possible k-mers
  6. You can skip between skip nodes and non-skip nodes
  7. For most reads, only two k-mers were hashed because all k-mers from the entire read mapped to a continguous region in the T-DBG
  8. It is really fast; they can generate bootstraps with -b which can help you model technical variance in dataset (we are not going to do these bootstraps)
  • Used the following code to run readMapping.sh (all of fastqc files)
  • It resulted in 1037056 contigs and 116726086 k-mers
chmod 777 readMapping.sh
./readMapping.sh
  • Getting the following error for HTS0101, HTS0302, HTS0701, HTS0801, and HTS0901.
Failed to process file HTS048_09_3C_LF_S9_L001_R1_001.fastq
uk.ac.babraham.FastQC.Sequence.SequenceFormatException: Ran out of data in the middle of a fastq entry.  Your file is probably truncated
        at uk.ac.babraham.FastQC.Sequence.FastQFile.readNext(FastQFile.java:187)
        at uk.ac.babraham.FastQC.Sequence.FastQFile.next(FastQFile.java:129)
        at uk.ac.babraham.FastQC.Analysis.AnalysisRunner.run(AnalysisRunner.java:77)
        at java.base/java.lang.Thread.run(Thread.java:1623)
  • Originally ran with single-end read mapping, but should have used paired-end read mapping to demonstrate reading from 3’ to 5’ and 5’ to 3’ for completeness
    • With single-end read mapping, reads get sloppy at the beginning and end (paired-end avoids this)
    • This is why quality control failed at per base sequence content
  • To rerun with paired-end read mapping, I edited readMapping.sh, ran conda activate, navigated to raw Data folder, and ran ./readMapping.sh.

0.6 Understanding RNAseq Count Data

0.6.1 Part 1: Preamble

  • Downloaded step 1 script
  • We have used command line tools to assess the quality of the raw reads (quality control)
  • Used Kallisto to take a fasta file and turn it into an indexed file (constructing an index - makes it easier and faster to search again)
  • Used Kallisto to map our raw reads against the transcriptome to get summarized read counts
  • I do not have a tsv file from Kallisto - going to rerun to see if I get one

0.6.2 Part 2: Measuring Digital Gene Expression

  • Effective length of transcript is taking the actual length - length of the fragment + 1 (length of a transcript after adjusting for the total number of possible positions a fragment of size X could originate from)
  • Transcript is from the transcriptome & fragment is one of our reads
  • RNAseq gives a relative measure of gene expression
  • Techniques used to measure mRNA abundance and reads for RNAseq data are used to estimate a statistic that is as closely proportional to the relative molar concentration as possible
  • Understanding the units for measurement for RNAseq is critical to understanding how we determine differential expression
  • Normalization = any number of ways that a dataset is globally altered to improve our ability to detect differently expressed genes
  • What happens to our data when we tweak it in different ways?
  • The gene that is the lowest expressed has a low SD & vice versa
  • This is problem because it makes it harder to pull these highly expressed genes out (heterscedasticity)
  • Convert raw counts with a log 2 conversion, to make the most lowly expressed gene have the highest SD & vice versa
  • With large datasets, fixing one problem globally creates another (check your assumptions and fit them to your data)
  • It is better to represent RNAseq data in a log 2 scale
  • The expected reads per sample would be even for all samples (the world is not this way, will be unbalanced) - between samples; adjusts for the length of feature
  • Scaling units for within-sample comparisons - different regions of may be different sizes; adjusts for depth of sequencing of sample
  • reads per kilobase, per million reads (RPKM) = between sample + within sample; unit of expression for RNAseq data; (# of reads mapped to a genomic region) / [(region length in kb)*(total # reads)] * 10^6
  • RPKM is no longer the accepted unit; no sufficient for normalization
  • The problem with RPKM is that the average relative molar concentration for each and every sample of RNAseq data mapped to the same genome is the same value - averages will be very very different between samples
  • Fixing RPKM is easy - scale by gene length first and create the transcript per million (TPM) = (reads per Kn) / (total RPK in sample); now the TPMs all total to 1
  • Further normalization steps need to be taken before one can compare between samples
  • Library size scaling is too simple for many biological applications
  • Scaling will all be messed up because of the number of original genes matching to each one
  • Most current normalization methods attempt to find a set of genes with minimal variance across samples for normalization
  • Statistical tools make inherent assumptions about your data
  • Most genes are not differentially expressed in your dataset (implications for comparing very different treatments/conditions)
  • Approximating equivalent numbers of up and down regulated genes

0.6.3 Part 3: Starting Our R Project and Step 1 Script

  • When you start a new R project, it starts a new R Session, the .Rprofile in the project’s main directory, access commands in .Rhistory file
  • R script != to RStudio prject (can have many R scripts associated with it)
  • Edit -> Folding -> Collapse All to break down script into code chunks (foldable; add “—” after title)
  • Edit -> Folding -> Expand All to undo it
  • RStudio is text-aware
  • Click install on right-hand pane under “Packages”
  • Run lines of code by highlighting them
setRepositories() # installs all R packages
 1 2 3 4 5 6 7 8

0.7 Starting Your R Workflow

0.7.1 Part 1: Starting Step 1 Script

  • Tidyverse is an environment of R packages
  • ?read_tsv gives you manual page (help documentation)
  • If you highlight an object and click Control+Enter, it will show you its contents
  • setRepositories() before installing R packages
  • Automate file paths
  • Downloaded studydesign.txt from data section and updated it with appropriate folder names
  • My all(file.exists(path)) only produces one true - but it ends up working

0.7.2 Part 2: Tapping Into Annotation Databases and Reading Kallisto Data into R

  • Want to map transcription level expression data to gene level expression data (need to have one-to-one mapping of transcript ids to gene ids) - this is because a transcript is not helpful unless we know which gene it is from
  • BiomaRt allows users to connect to a database of many different annotations
  • You can pull out all of the gene level annotations for a dog (ex.)
  • A tibble is a tidied-up data frame
  • With non-model organisms, there are often fewer gene symbols than transcript identifiers available
  • Toggling txOut to TRUE will allow you to change the ordering of genes to transcripts
  • We used the tximport package to read our Kallisto outputs into the R environment
  • We used EnsemblDB to attach gene level information to our transcription data (transfers our data to gene level)

0.8 Wrangling Gene Expression Data

0.8.1 Part 1: Starting Step 2 Script

  • The Step 2 Script will be used filter our data and get normalized read counts
  • R Data Types:
  • vector = series of objects
  • matrix = has rows & columns; will have the same kind of data
  • array = several matrices stacked together
  • dataframe = matrix but can have mix of types
  • list = any combination of the above data types
  • Our transcript list will be far bigger than our gene list
  • The # of elements is the number of data points in the matrix
  • There are a lot of transcripts with no evidence of gene expression data
  • “Counts” are the actual number of reads matching to that gene in that sample (it is an element in our list)
  • “Abundance” refers to the number of reads present
  • Sum of our columns in myTPM should be close to 1 million (shows how many of those transcripts are gene maps) - transcripts per million refers to transcripts not genes
colSums(myTPM)
[1] 912920.2 910074.9 911529.2 909853.9 909855.2 909729.7
  • Sum of our columns in gene level data does not sum to 1 million (myTPM_Counts)
colSums(myTPM_Counts)
[1] 48423530 59355199 43597530 52757379 53130928 52197283
  • Use $ to get a column of a data frame

0.8.2 Part 2: Walking Through the Step 2 Script, & Relating Our Work to the ‘Grammar of Graphics’ and ‘Tidy’ Data

  • It is extremely easy to manipulate a data matrix (calculate standard deviation, average, and median)
  • ggplot() creates a graph using layers (you can stack these - styles) - give it the data you want it to graph and give it the aesthetics (‘grammar graphics’)
myTPM.stats <- transform(myTPM, 
                         SD=rowSds(myTPM), 
                         AVG=rowMeans(myTPM),
                         MED=rowMedians(myTPM))
                         
ggplot(myTPM.stats) + 
  aes(x = SD, y = MED) +
  geom_point(shape=25, size=3)
  • There is heteroskedasticty across our data (unequal variance across median values) - representing our data on a log2 scale helps with this (impacts of count data)
  • ggplot2 shape parameter = menu for type of shape for data points

Summary Stats (Median v. SD) For Each Transcript or Gene:

  • Use geom_hex instead of geom_point to represent data as a hexagonal bin (heatmap) - shows relative expression of a certain gene
ggplot(myTPM.stats) + 
  aes(x = SD, y = MED) +
  geom_hex(shape=25, size=3)
  • You can specialize themes of your ggplot with various theme packages
  • ggplot are plots on plots (layered)
  • Benefits of ggplots:
  • Users can iteratively update a plot, changing specific elements as needed
  • Even high-level aspects of a plot can be changed
  • Layers allow plots to be overplayed
  • Layers also make it easy to use # to toggle layers on/off
  • Default arguments for each layer minimize the need for users to dictate details
  • We make a DGElist (data structure) with our counts - has two elements, 1. counts and matrix of sample labels, 2. samples with group library size and normalization factors (good to know how large our files are)
  • We utilize counts per million instead of TPM because it shows how many times we hit this particular gene, rather than reading a lot of short fragments that don’t really align (noise)

Counts Per Million on Log2 Scale:

  • We will coerce our data matrix (respects row names) to be a data frame (does not respect row names - we need to save these as a column gene idea)
  • When we wrangle our data, we will often lose our sample labels, so we will need to add them back in
  • We will now work to filter our data to reduce any noise
  • Tidy data philosophy = “A dataset is messy or tidy (long) depending on how rows, columns and tables are matched up with observations, variables, and types.”
  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table.
  • Gene expression data is always messy (wide) - we need to take every variable and make it form a column
  • Tidy data is ideal for graphing - values are easily retrieved as a single vector
  • We can plot the statistical summary of our data - plot every sample on the x-axis and the log2 on the y-axis

Filtered Data:

  • The distribution of the data for each sample, and we were able to bin the data by sample
  • It would be unusual if there was a lot of variation among samples
  • You can flip the coordinates on the sample by adding coord_flip()
  • We are trying to make a logical matrix of all of the genes
  • 12229 / 35371 gene samples have no counts at all (great!) - you can summarize this with the table command
  • We now set the cutoff for the number of samples in the smallest group of comparison & filter it
  • We are using subsetting to tell it which rows or columns we want to remove or keep
  • The following result means that we removed 11752 rows of our data that did not have more than one count per million (cpm) in at least 3 of the samples
keepers <- rowSums(cpm>1)>=3
myDGEList.filtered <- myDGEList[keepers,]
dim(myDGEList.filtered)
[1] 11752     6
  • We now normalize our data to make them have consistent units
  • The normalization factors are now not all 1 because it is supposed to control for differences in distributions (tweaked up and down)
  • We need to normalize our data by applying a normalization method “TMM”
  • Our normalization of the data does not look very different from the non-normalized data meaning that we have lots of data and reads
log2.cpm.filtered.norm <- cpm(myDGEList.filtered.norm, log=TRUE)
log2.cpm.filtered.norm.df <- as_tibble(log2.cpm.filtered.norm, rownames = "geneID")
colnames(log2.cpm.filtered.norm.df) <- c("geneID", sampleLabels)

Normalized Data:

  • Today, we took plotted the raw, filtered, and normalized data and evaluated their differences.

Combined Graphs:


0.9 Data Exploration

0.9.1 Part 1: Discussion of multivariate data, dimensional reduction via PCA, and starting our Step 3 script

  • Tidyverse packages and functions only work with data frames
  • The readr package always produces a data frame
  • tibble = data frame with nicer display and subsetting
  • piping = makes code more succinct and readable (%>%)
  • When it comes to sources of variance, not all experiments are created equal
  • As you increase the number of variables, you increase the variance (the type of experiment you do will influence your sources of variance and the types of data you need to record)
  • examples of sources of variance: people conducting it, batch prep, variable confounded with variable of interest, sex (did we check for this?)
  • You need to record every variable in order to visualize the variance of our data
  • Dimensional reduction = reducing number of variables; PCA is a method of this
  • Principle Component Analysis (PCA) = find a linear combination of your variables that captures the largest spread (variance) in your cloud of data points (line of best fit); reduces number of dimensions in data set; you then draw a line perpendicular to that one (will account for less of variance than the first line); unsupervised learning method which has no information about sample labels
  • Running studydesign.txt file which tells us what PIC and LF are
  • We pull out the virus/healthy determinations and identify that these are the 2 classes with factor(group)
  • Hierarchical clustering only works on matrices; find the distance between points and cluster them (produces a diagonal matrix, shows distance of every point to another); produces a tree (dendogram)
  • right/left has no meaning in these kinds of trees, only vertical distance matters (reflects distance in distance matrix)
  • It is really interesting that the LF and PIC samples are not clustering together respectively
  • Is LF08 really an outlier? - maybe has the fewest of all reads aligning (may be technical reason why it is different and not biological)

Cluster Dendogram of Raw Data

  • The results below show that bulk of our data is explained by PC1, 2, 3, and 4
summary(pca.res) # Prints variance summary for all principal components.
Importance of components:
                          PC1     PC2     PC3     PC4     PC5       PC6
Standard deviation     12.834 12.4688 12.1273 11.0053 10.1853 6.316e-14
Proportion of Variance  0.238  0.2246  0.2125  0.1750  0.1499 0.000e+00
Cumulative Proportion   0.238  0.4626  0.6751  0.8501  1.0000 1.000e+00
  • rotation = how much each gene influenced each PC
  • x = how much each sample influenced each PC.

Variance by Each PC

  • The following results show the variance per PC, this means that our data is pretty spread out
> pc.per
[1] 23.8 22.5 21.2 17.5 15.0  0.0

0.9.2 Part 2: Plotting PCA results and small multiples

  • The data has both magnitude and direction - shows you how much each sample contributed to each PC

Original PCA Plot

  • You can use objects as axis labels
  • We now add colors to make our graph more informative

PCA Plot Colored

  • This graph does not really give meaning to PC1 (does not graph by LF or PIC)
  • Additionally, the function stated there were two few data points in order to accurately cluster the samples

PCA Plot Labeled

  • ‘Small multiples’ plot = similar graphs using the same axes and labels; make it easy to spot interesting combinations of variables; shows magnitude and direction
  • pipe function = %>% (Magrittr function as part of tidyverse); streamlines our code; runs multiple functions on successive code in one line
  • A tibble allows us to view our data as a single numeric vector

PCA Small Multiples Plot

  • None of our small multiple plots of PCs tell us about healthy vs virus
  • PC2 and PC4 tells us about outliers
  • Take the time to thoroughly explore your data before downstream analysis
  • .fastq files have a lot of information in the headers

0.9.3 Part 3: Producing interactive tables and plots

  • dplyr = package used to mutate (columns), filter (rows), select (columns), and arrange (rows) things in data frames
  • gt package is great for making tables

Gene Expression in the Presence vs Absence of Poly I:C Formatted

  • use ggplot to make interactive (tooltip) plots

LogFC Table Formatted

  • These plots do not tell us anything about significance however
  • This graph shows that the same genes tend to be expressed in the presence and absence of poly I:C
  • It is possible that I did not choose genes involved in the immune response but are rather expressed all of the time

0.10 Accessing Public Data

0.10.1 Part 1: Accessing Public RNA-seq Data

  • Sequence Read Archive (SRA) = NCBI public sequencing database
  • You can directly download the .fastq files from there
  • There are SRA tools for the command line
  • Gene Expression Omnibus (GEO) = data generation outpaces analysis
  • Community-driven projects have helped to standardize sample collection, storage and data generation
  • Tools for rapid and large-scale of public RNA-seq data have lagged behind
  • Hierarchical Data Format v5 (HDF5) = self-describing file format; uses groups and datasets instead of folders and files; perfect for storing genetic expression data; have different types of data in same file
  • ARCHS4 = “All RNA-seq and ChIP-seq sample sequence search”; scales resources to massive alignment

0.10.2 Part 2: Step 4 Script to access the ARCHS4 Database

  • Had trouble downloading the human and mouse sample files
  • hdf5 file = has groups and datasets; contains metadata;
  • Samples from the same study are grouped together in the ARCHS4 Database
  • You can create your own studydesign file with ARCHS4 (with sample title, sample source, and sample characteristics)
  • Use it to determine the relationship between your PCs and the metadata
  • Will go back in later with results once sample files finish downloading

0.11 Checking Viability of Results

  • Since the photos produced above show that there is little differentiation between the PIC and LF samples, we re-ran scripts 1, 2, and 3 above to ensure that this is correct
  • We reran them because based on our hypothesis we would expect our control and experimental conditions to produce varying results in gene expression, however, this is not what we see
  • Following making PCA plots for filtered & unnormalized data, unfiltered & normalized data, and unfiltered & unnormalized data (and with Rebecca’s checking), we found that there are indeed many similarities between the two conditions
  • As a result, I will continue to carry out the following analyses with this in mind
  • I reran the scripts to ensure it was filtered and normalized
  • I filtered for these specific immune genes & there were no results. This may mean that there was no expression of these genes.
geneID=="MMP1" | geneID=="GZMB" | geneID=="IL1B" | geneID=="GNLY" | geneID=="IFNG" | geneID=="CCL4" | geneID=="PRF1" | geneID=="APOBEC3A" | geneID=="UNC13A" 

0.12 Differential Gene Expression

0.12.1 Part 1: Introduction to Differential Feature Selection in R/Bioconductor

  • We have been preparing for DGE analysis all along
  • Kallisto can do sequencing so quickly through bootstrapping - repeated bootstrapping can give insight into uncertainty for counts of RNAseq data
  • DGE = differential gene expression
  • DTE = differential transcript expression
  • DTU = differential transcript usage (what percentage of output is due to specific isoforms - isoform switching)
  • Genes can be more accurately measured than transcripts (compare actual TPM to estimated TPM)
  • If you aggregate transcript-level data to gene-level data, you get more accurate counts of gene expression

0.12.2 Part 2: Starting Step 5 Script for DGE Analysis

  • design matrix = choice of comparisons is not inherent from studydesign file; explicit about what experimental design is
  • model matrix = helps you set up how you are going to apply a linear model; assign 1 or 0 according to PIC or LF
  • RNAseq data has an unequal variance/range across it (heteroskedasticity)
  • Voom = package that models mean-variance relationship & will stabilize your data
  • We are fitting a linear model to our data to view the mean-variance trend
  • The genes that are more differentially expressed have a lower standard deviation - stats.

Bayesian Stats:

0.12.3 Part 3: Visualizing DGEs

  • Volcano plot = representation of impact a treatment/condition that a pairwise comparison has on our dataset

  • Our Volcano plot does not look like his - things to the right of 0 are upregulated in virus genes, things to the left are healthy genes
  • Genes with low p-values are very high up
  • We can set arbitrary cutoffs and visualize them

  • I adjusted the cutoffs to highlight the 7 genes that were upregulated by the presence of poly I:C

  • We can also pull out the DEGs - slightly odd to me that we only had 7 genes that were differentially expressed in the presence of Poly:IC
  • Same level of expression at 0 between 2 conditions
  • More in Poly I:C condition to right of 0 - which is what we see here
  • We only care about things that are really highly up-regulated (spread out further on X)
  • Vertically on the graph shows p-value (significance)
  • Cutoff p-value of 2 is too high –> changed to 1.3
  • Not a lot of genes turning off - common for viral immune response
> summary(results)
       infection
Down           0
NotSig     11745
Up             7
  • Had to review some code and ensure I changed disease to virus for studydesign purposes

Venn Diagram of DEGs:

Data Table of DEGs:

  • ISGs are interferon stimulated genes
  • Look uo the genes and understand more about what they do, especially DISP
  • ISG15 = “interferon-stimulated gene”; known role in antiviral immune response
  • IRF7 = “interferon regulatory factor 7”; transcriptional activation of virus-inducible cellular genes; master regulator of IFN-I immune responses
  • IFIT3 = “interferon-stimulated gene 49”; stimulates IFN production during RNA virus infections
  • IFIT1 = antiviral protein that recognizes 5′-triphosphate RNA
  • IFI6 = IFN-stimulated gene (ISG) whose expression is highly regulated by the stimulation of type I IFN-alpha that restricts various kinds of virus infections by targeting different stages of the viral life cycle
  • IFI27 = counteracts innate immune responses after viral infections by interfering with RIG-I signaling
  • DISP2 = one of two human homologs of a segment-polarity gene known as dispatched identified in Drosophila; product of this gene may be required for normal Hedgehog (Hh) signaling during embryonic pattern formation
    • A Headgehog pathway is a signaling pathway that transmits information to embryonic cells required for proper cell differentiation
  • Overall, the presence of poly I:C upregulated antiviral immune-related genes

0.13 Module Identification

0.13.1 Part 1: Intro to Clustering and Starting the Step 6 Script

  • Clustering = group genes or transcripts together based on their behavior across the samples we measure the expression in; results in modules of expressed genes
  • We need to separate out genes that are behaving in very different ways
  • We are creating a character vector of colors (Rcolorbrewer)
  • Having trouble creating diffgenes - need to re-check DEGList
  • Changed dim(diffgenes) to dim(diffgenes.df)
  • Replace “Data” section with (just use diffGenes from Step5 script):
results <- decideTests(ebFit, method="global", adjust.method="BH", p.value=0.01, lfc=2)
colnames(v.DEGList.filtered.norm$E) <- sampleLabels
dim(diffGenes)

0.13.2 Part 2: Making and Interpreting Heatmaps

  • Making a heatmap out of a data matrix
  • Make a distance matrix based on correlations of our genes
  • First, cluster genes (rows) in each set of differentially expressed genes (need to find pairwise correlations of our genes)
  • Secondly, cluster our samples (columns)
  • Colorblindness matters when creating heatmaps, be aware
  • scale = ‘row’; heatmpa according to row z-scores; puts emphasis on comparing between samples
  • scale = ‘none’; heatmap according to value; puts emphasis on comparing between genes (will skew image so you view most highly expressed genes

Cluster Heatmap:

Cluster Interative Heatmap:

  • We now want to pull out modules of co-expressed genes (genes that are upregulated or downregulated)
  • Changed code below to 1:7 instead of 1:2307
module.assign.pivot <- pivot_longer(module.assign.df, # dataframe to be pivoted
                          cols = 1:7, # column names to be stored as a SINGLE variable
                          names_to = "geneID", # name of that new variable (column)
                          values_to = "module") # name of new variable (column) storing all the values (data)

Heatmap for Each Subcluster:

0.13.3 Part 3: Alternative Clustering Methods

  • Clustering:
    • Unsupervised = blind to groups (hierarchical, k-means, etc.); assign all genes to clusters (data partioning); genes can only belong to one module (binary)
    • Supervised = group aware (random forest, neural nets, etc.)
  • By adding a gene that does not fit into a module, we are creating false positives
  • Modules can differ dramatically based on clustering method used
  • Clust generates “tight” modules; command-line tool

0.14 Functional Enrichment Analysis

0.14.1 Part 1: Introduction to Enrichment Analysis Methods

  • Functional Enrichment Analysis = go from a collection of genes that have patterns to them, to understanding what the genes do; “pathway analysis”
  • 2 main approaches:
    • Gene Ontology (GO) = structured vocabulary for describing gene function; requires arbitrary selection of genes; does not require any data (gene identifiers only); species specific; user doesn’t control vocabulary; manually curated; answers where a gene product is localized, what functions it performs in that location, and how it carries out that function; redundant and often requires summarization of results
    • Gene Set Enrichment Analysis (GSEA) = no need to pre-select genes; requires both identifiers and data; not species specific; user can control vocabulary; like counting cards; 1) ranks rows/transcripts in order of magnitude of change (up/downregulation), 2) rank order genes; 2 types: self-contained (are any of the genes in my set differentially expressed?) and competitive (are genes in set more commonly differentially expressed than genes in set?)
  • By clustering the genes in modules, we improve our ability to detect functionaly enriched themes in these genes
  • Think about why you set out to do an RNASeq analysis in the first place (regulation of gene expression under our conditions)
  • MSigDB = curated signature database
  • Permutation = must be conducted for GSEA to determine if an enrichment score is actually statistically significant; permuting samples is preferred but large n required; permuting genes is common but can inflate type I error due to inter-gene correlations

0.14.2 Part 2:

  • When running gost, it carries out a functional enrichment analysis comparing all of the genes in my set to all of the genes annotated in humans to see whether our genes are enriched/expressed more than they would be by chance

Manhattan Plot of Functional Enrichment Analysis Using GO:

  • The circles that are the farthest up are the most enriched
  • TF:M11685_1 is most enriched (tallest blue dot); the rest are mostly interferon signaling

  • In upregulation module is:
> myModule
         PIC01    PIC02    PIC03     LF07      LF08      LF09
IFI27 4.185237 3.912628 3.370907 2.009563 1.9274367 2.0385980
IFI6  3.814688 3.502734 2.845341 1.037494 0.9473482 0.7926239
IFIT1 4.584897 4.468331 3.980659 2.404900 2.6235918 2.2917992

  • In downregulation module is:
> myModule
         PIC01    PIC02    PIC03       LF07       LF08         LF09
DISP2 2.788966 2.914997 2.739440  0.6086220  1.2242679  1.254558268
IFIT3 3.664853 3.334991 3.145716 -0.7127142  0.8762171 -0.004597556
IRF7  2.869427 2.969015 2.898255  1.6208349  1.7143503  1.472258952
ISG15 1.677950 1.701252 1.534261 -0.4579730 -0.4314073 -0.388628741

0.14.3 Part 3:

0.14.4 Part 4:


0.15 Running Scripts with Unstimulated and poly I:C data

  • Created new studydesign_unst.txt, new readMapping_UNST.sh file
  • Move into Bioinformatics_Research directory and run the following lines in terminal
conda activate rnaseq
cd raw_data
./readMapping_UNST.sh
  • Code taking too long to run, need to download screen

0.16 Running GO and GSEA Functional Enrichment Analysis on External Sites

  • Struggling with using GSEA app and getting input data in the correct form

0.17 Making Presentation Ready Figures

  • Edited violin plots
p3 <- ggplot(log2.cpm.filtered.norm.df.pivot) +
  aes(x=samples, y=expression, fill=samples) +
  geom_violin(trim = FALSE, show.legend = FALSE) +
  stat_summary(fun = "median", 
               geom = "point", 
               shape = 95, 
               size = 10, 
               show.legend = FALSE) +
  labs(y="Log2 Expression in Counts Per Million", x = "Samples",
       title="Log2 Counts per Million (CPM) Across Samples",
       subtitle="filtered, TMM normalized",
       caption=paste0("produced on ", Sys.time())) +
  theme_minimal()

p3 + scale_fill_manual(values=c("#E69F00", "#56B4E9", "#009E73",
                                     "#F0E442", "#0072B2", "#D55E00"))
                                     
plot_grid(p1 + scale_fill_manual(values=c("#E69F00", "#56B4E9", "#009E73",
                                          "#F0E442", "#0072B2", "#D55E00")), 
          p2 + scale_fill_manual(values=c("#E69F00", "#56B4E9", "#009E73",                                                "#F0E442", "#0072B2", "#D55E00")), 
          p3 + scale_fill_manual(values=c("#E69F00", "#56B4E9", "#009E73",
                                          "#F0E442", "#0072B2", "#D55E00")),              labels = c('A', 'B', 'C'), label_size = 12)

  • Edited graphs from PC analysis
  • Wanted to change labels but having trouble with ggplot command

  • Adjust Volcano Plot
ggplot(myTopHits.df) +
  aes(y=-log10(adj.P.Val), x=logFC, text = paste("Symbol:", geneID)) +
  geom_point(size=3.5,colour="#FF696D") +
  geom_hline(yintercept = 1.3, linetype="longdash", colour="black", size=1) +
  geom_vline(xintercept = 1, linetype="longdash", colour="black", size=1) +
  geom_vline(xintercept = -1, linetype="longdash", colour="black", size=1) +
  annotate("rect", xmin = 1, xmax = 4, ymin = 1.3, ymax = 1.8, alpha=.2, fill="grey") +
  #annotate("rect", xmin = -1, xmax = -12, ymin = -log10(0.01), ymax = 7.5, alpha=.2, fill="#2C467A") +
  labs(title="Volcano Plot of Gene Regulation",
       caption=paste0("produced on ", Sys.time())) +
  theme_minimal()


0.18 Creating Heatmap of Differentially Expressed Genes

  • Having trouble determining x and y for heatmap
> diffGenes.df
# A tibble: 7 × 7
  geneID PIC01 PIC02 PIC03   LF07   LF08     LF09
  <chr>  <dbl> <dbl> <dbl>  <dbl>  <dbl>    <dbl>
1 DISP2   2.79  2.91  2.74  0.609  1.22   1.25   
2 IFI27   4.19  3.91  3.37  2.01   1.93   2.04   
3 IFI6    3.81  3.50  2.85  1.04   0.947  0.793  
4 IFIT1   4.58  4.47  3.98  2.40   2.62   2.29   
5 IFIT3   3.66  3.33  3.15 -0.713  0.876 -0.00460
6 IRF7    2.87  2.97  2.90  1.62   1.71   1.47   
7 ISG15   1.68  1.70  1.53 -0.458 -0.431 -0.389

gg <- ggplot(diffGenes.df, aes(y=, x=PIC01, text = paste("Symbol:", geneID)))
gg <- gg + geom_tile(color='White', size=0.1) + scale_fill_viridis(name="Temperature") + coord_equal() + ggtitle("Differentially Expressed Genes Across Samples")
gg