#Rsubread builds reference genome index, maps sequence reads to annotated index
library(Rsubread)
#For DGE analysis and visualization
library(edgeR)
#ggplot2
library(tidyverse)
#interactive MD plot w/ gene-wise expression 
library(Glimma)
#top gene table
library(reactable)
#annotation of top gene table
library(annotate)
library(org.Hs.eg.db)

Introduction:

The focus of this paper is the statistical analysis of a high throughput RNA-seq dataset of Fulvestrant-treated MCF-7 breast cancer cells and an untreated MCF-7 group as a control. (Madak-Erdogan 2020) (Lee 2015) This analysis is conducted by fitting a Quasi-Negative Binomial model with quasi-likelihood dispersion parameter estimation using empirical Bayes methods. (Lun 2015) Fulvestrant is a first-in-class selective estrogen-receptor (ER) degrader, it functions by binding the ER, which brings forth degradation of the ER. This kinetic disruption blocks ER ligand-interactions, in effect, eliminating estrogen as a transcription factor. This process deprives tumors from the growth-effects of estrogen. (Fulvestrant: Medline) Disrupting the homeostatic ligand-interaction of MCF-7 cells should be apparent in their gene expression profiles. High-throughput sequencing of extracted mRNA from this experiment will reveal the gene expression profiles of cells; therefore, this technology allows us to interrogate the changes that occur under Fulvestrant-treatment. Comparing Fulvestrant-treated and untreated MCF-7 cells brings forth the question, what is the relationship of Fulvestrant treatment and the gene expression profiles of MCF-7? This question is addressed with the null hypothesis, that there is no significant difference in the gene expression profiles of Fulvestrant-treated and untreated MCF-7 cells. This statistical analysis is conducted using the R package “edgeR”, which employs a quasi-negative binomial model (quasi-NB) to accommodate the experimental design and increase testing power. The adjustments to the NB model are made due to properties of the mean-variance relationship of gene abundance with low sample size and the technical variation of uneven sample sizes and sequencing artefacts. These modifications include normalization factor for uneven sample sizes/mRNA composition bias and quasi-likelihood (QL) estimation with empirical Bayes (EB) methods to improve dispersion estimation. The shrunken data points fit mean estimates that are then used for identification of differential gene expression (DGE) via QL f-tests. (Lun 2015)

Methods:

This analysis was made possible by the in vitro experimentation of Zeynep Madak-Erdogan (University of Illinois at Urbana) and high-throughput sequencing of Illumina. The data generated by these sources was made publicly available on the National Center for Biotechnology Information (NCBI) as part of their Gene Expression Omnibus (GEO) system. The file hosting and sharing is part of the Sequence Read Archive, which requires a cloud service such as Amazon Web Service (AWS) for file-transfer-protocol access. (SRA in the Cloud) The data was obtained as raw sequence reads in FASTQ format.

Data pre-processing: Six FASTQ files, one file per sample, containing the sequencing information was generated by the Illumina Novaseq 6000 platform. These files are raw sequence data and require preparation in order to determine the mRNA diversity and abundance of the samples. The R package “RSubread” is utilized for this RNA-seq data pre-processing. The in-built functions of map sequences to an exon-annotated reference genome, and then performs quantification that determines sequence abundance. (Laio 2019)

Rsubread intakes the raw FASTQ file and transforms it into a data frame of mapped reads in BAM file format. Before the gene-mapping can occur, a hash-table index of the human genome needs to be built. This process occurs by inputting the European Bioinformatics Institute reference human genome (GRCh38) into the buildindex() function. (The New Improved Human Genome 2020)

This genome reference hash table is utilized in gene mapping within the align() function. The FASTQ files are iteratively scanned in 16-basepair pieces and assigned to a slot on the hash table. These 16-bp reads overlap and the algorithm decides which gene to ascribe the read to by counting which hash-table index received the most assignments from a particular region. This process is called “seed-and-vote” by the authors of RSubread. (Laio 2019)

####The purpose of the workflow is to find differentially expressed genes ##from a RNA-seq data of Estrogen-receptor+ breast cancer cell lines #(MCF-7), a group treated with 
#fulvestrant (estrogen receptor antagonist) and a vehicle group. 
#Data obtained from:
#https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE136823 
#courtesy of:Zeynep Madak-Erdogan @ University of Illinois at Urbana
#data was downloaded from Sequence Read Archive via 
#Amazon Web Services S3 could system

#Build list of the raw sequence reads (data obtained from SRA) 
#source: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE136823
#credit: Zeynep Madak-Erdogan
fastq.files <- list.files(path = "rnaseq/", pattern = ".fastq$", full.names = TRUE)

##abbreviate sample names
fastq.abbrv <- substr(fastq.files, start = 8, stop = 9)

############################################################# 
####### __        ___    ____  _   _ ___ _   _  ____  #######
####### \ \      / / \  |  _ \| \ | |_ _| \ | |/ ___| #######
#######  \ \ /\ / / _ \ | |_) |  \| || ||  \| | |  _  #######
#######   \ V  V / ___ \|  _ <| |\  || || |\  | |_| | #######
#######    \_/\_/_/   \_\_| \_\_| \_|___|_| \_|\____| #######
#Index Building & Gene Mapping is Computationally Intensive!!
###TAKES HOURS @ 8GB/RAM ###Enter at your own RISK!!ლ(ಠ益ಠლ)                                         
#Build the index - this takes the entire human genome, decompresses it 
#and builds a hash table of key-value pairs of 16-bp sequences (keys) 
#are assigned to chromosomal locations (values)
#this will be used to map the rna-seq reads to reference genome from: 
#GENCODE database via the following links:
#ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/GRCh38.primary_assembly.genome.fa.gz
#
#index is in directory
buildindex(basename="Ebi_Homo_Sapien_Index",reference="GRCh38.primary_assembly.genome.fa.gz", gappedIndex = TRUE, indexSplit = TRUE)

##################Rsubread Alignment################
#####   align() creates an index hash table 
#that mediates the mapping of sequence reads to the reference index
align.stat <- align(index = "Ebi_Homo_Sapien_Index", readfile1 = fastq.files)

#Output is .BAM file, this is used for 
#normalizing and counting the sequence reads

The aligned reads are summarized below:

#build a vector of themapped read file names
bam.files <- list.files(path = "rnaseq/", pattern = ".BAM$", full.names = TRUE)

The function featureCounts() performs quantification of the mapped reads. The argument parameters of the fulvestrant experiment were set to quantify the non-coding strand sequence of exons for the human genome.

#Prepare .BAM files for quantification
#build a vector of the mapped read file names
bam.files <- list.files(path = "rnaseq/", pattern = ".BAM$", full.names = TRUE)


#featureCounts generates a matrix of read counts of mapped genes in each sample
#strandSpecific = 2 because 'TruSeq Stranded mRNAseq Sample Prep kit' (Illumina). 
#Read 1 aligns to the ANTISENSE strand and Read 2 aligns to the SENSE strand,
#this maps the mrna sequence to the annotated exons within inbuilt annotation

fcRS <- featureCounts(bam.files, annot.inbuilt="hg38", strandSpecific = 2)

Within the outputted featurecount object contains the counts, which is 28,395 rows long, each row containing a gene and 6 columns wide, with a sample in each column. This fc object was written to files in multiple formats for downstream statistical analysis in the packages “edgeR” and “Glimma”.

###This chunk is a files that were written for downstream processing or analysis

####write counts to table 
fcRS_counts <- fcRS$counts


#output a txt file of count data
write.table(x=data.frame(fcRS$annotation[,c("GeneID","Length")],fcRS$counts,stringsAsFactors=FALSE),file="RScountsEBI.txt",quote=FALSE,sep="\t",row.names=FALSE)

#For edgeR quasi-likelihood DGE
write.table(x=data.frame(fcRS$annotation[,c("GeneID")],fcRS$counts,stringsAsFactors=FALSE),file="counts.txt",quote=FALSE,sep="\t",row.names=FALSE)

#load data into a table (for count distribution)
EBI_RStrandedCountTable <- read.table(file = 'RScountsEBI.txt', header = TRUE)

#load data into a table for quasi-likelihood DGE
CountTable <- read.table(file = 'counts.txt', header = TRUE)

Results:

The initial analysis was a multi-dimensional scaling plot constructed by the function plotMDS() within edgeR. This plot is a principal component analysis, logfold values of count data is generated, then used to construct multi-dimensional differences between the samples. The dimensionality is then reduced to 2D to show similarity between the samples.

#Prepare for PCA Plot
#Create a DGEList object from featureCount object for downstream analysis 
DGEfulvestrant <- DGEList(counts=fcRS$counts, genes=fcRS$annotation[,c("GeneID","Length")])

####Filter to keep genes which had  more than 10 reads per million mapped
#reads in at least two libraries. 
FilteredGenes <- rowSums(cpm(DGEfulvestrant) > 10) >= 2
#write to dataframe
DGEfulvestrant <- DGEfulvestrant[FilteredGenes,]

##Normalization. Perform voom normalization: Adjusts for between sample
#sequence library size so statisical analysis can be performed
Normalization <- voom(DGEfulvestrant,designmatrix,plot=T)

####MULTI-DIMENSIONAL SCALING PLOT
####
#distances on the plot approximate the typical 
#log2 fold changes between the samples.
#This function is a variation on the usual multdimensional 
#scaling (or principle coordinate) plot, in that a distance 
#measure particularly appropriate for the microarray context 
#is used. The distance between each pair of samples 
#(columns) is the root-mean-square deviation (Euclidean distance) 
#for the top genes. Distances on the plot can be 
#interpreted as leading log2-fold-change, meaning the 
#typical (root-mean-square) log2-fold-change between the samples for the genes that distinguish those samples.

#tidy sample names(needed for Multi-Dimensional (PCA) plot labels later on)
fcRS$targets <- substr(fcRS$targets, start = 1, stop = 1)

plotMDS(Normalization, labels = samples, xlim=c(-1,1),ylim=c(-1.0,.5))

The untreated, vehicle group (V1, V2, V3) were clustered nearly identically in the first and most informing cluster, dimension 1, while the Fulvestrant-treated group had a tight grouping between F1 and F2 and separation on the second dimension for F3. This plot shows that the cells transcription reacted to the drugs and DGE analysis is worth pursuing. Inspection of the count and residual distribution was conducted to determine which statistical model is appropriate. Binning genes by raw expression counts on a histogram illustrates the exponential distribution of this data set. This is expected of mRNA-seq data, as a typical cell has many low-expression genes and relatively few high expression genes. (Mistry 2020)

#######HISTOGRAM OF Fulvestrant Sample 1 gene expression v. # of genes 
ggplot(EBI_RStrandedCountTable) +
  geom_histogram(aes(x = F1_CAAGCTAG.ACATAGCG_L002_R1_001..1..fastq.subread.BAM), stat = "bin", bins = 250, colour = "lightblue", fill = "orange") +
  xlab("Raw expression counts") +
  ylab("Number of genes")+
  xlim(0,20000)+
  ylim(0,1760)+
  theme_dark()

The mean-variance relationship is examined and displays an increase in variance as mean expression rises, a trend known as overdispersion.

#scatter-plot to display mean-variance relationship
####write counts to table (needed for mean/variance dependence plot and 
#later use in contrast matrix)
fcRS_counts <- fcRS$counts

#generate a vector of gene-wise mean expression
mean <- apply(fcRS_counts[,1:6], 1, mean)        
#The second argument '1' of 'apply' function indicates the function being applied to rows. Use '2' if applied to columns 
#Generate a vector of gene-wise variance
variance <- apply(fcRS_counts[,1:6], 1, var)
#assemble mean and variance into a dataframe
MeanVariancedf <- data.frame(mean, variance)


#scatter-plot to display mean-variance relationship
ggplot(MeanVariancedf) +
  geom_point(aes(x=mean, y=variance), colour = "orange", alpha = .8) + 
  scale_y_log10(limits = c(1,1e9)) +
  scale_x_log10(limits = c(1,1e6)) +
  geom_abline(intercept = 0, slope = 1, color="green")+
  xlab("Mean Gene Expression")+
  ylab("Variance of Gene Expression")+
  theme_dark()

Fitting a model: Generalized linear models for probabilistic and discrete mRNA counts that are highly influenced by systematic and stochastic sources of biological variation; as well as technical sources of variation are most reliable in estimating gene-wise mean expression. A poisson distribution is often unreliable when n is small, however, the central limit theorem allows for its use as n→∞. (Mistry 2020 ) However, a large n is often unrealistic to most experimental constraints. Instead, its advantageous to fit a gene-wise generalized negative binomial model:

Further, this is a two-parameter probability model that has dispersion parameter that accommodates the mean-variance relationship of experiments with few replicates (n= 1-4). A quasi-NB model can employ shrinkage on the dispersion parameter that accommodates the stochastic and technical variability that is present within a small sample size of this experimental design and is more robust than a one parameter poisson. (Yu 2013)
The analysis in this paper uses a quasi-NB generalized linear model from the package “EdgeR” where:

The authors of edgeR, Mark Robinson, Gordon Smyth and David J. McCarthy made this modification to the NB model for the nuances of sequencing data with with low n by normalizing the different library sizes and employing quasi-likelihood with EB methods on the dispersion parameter. (Robinson 2010) The dispersion parameter estimate is found by maximizing the “linear combination of per-gene and common-dispersion conditional likelihoods”. (Lun 2015) This method accounts for dispersion gene-wise and across sample information via EB methods, a robust method for the mean-variance dependence that contains biological and technical sources of variation that is amplified by low n. This is performed by the edgeR function estimateDisp(). A technical source of variation is due to sample mRNA composition biases affects the amount of sequencing reads, as an upregulated gene and longer genes will outcompete downregulated genes for sequencing resources such as dNTP’s. This bias is accounted for within the m_ij P_gi parameter, as the library size per sample/condition is considered with the probability that a sequence read maps to a gene. (Yu 2013) This adjustment is seen within the edgeR function calcNormfactors().

#create shrinkage of dispersion and calculate norm factors
#create group factors for contrast
group <- factor(c(1,1,1,2,2,2))
#DGElist creates a list with grouping factors for sample
y <- DGEList(counts=fcRS_counts,group=group)
#this filters lowly expressed genes with may increase the FDR -
#debatable if the genes are important and requires post hoc analysis
keep <- filterByExpr(y)
#rewrites with genes expression counts > 10
y <- y[keep,,keep.lib.sizes=FALSE]
#trimmed mean of M values(TMM) normalization allows comparison of different sized libraries
#the libraries
y <- calcNormFactors(y)
#create design contrast for glm model
design <- model.matrix(~group)
#estimate dispersion using linear combination of gene and sample wise likelihood
y <- estimateDisp(y,design)

We can evaluate the transformed data with the biological coefficient of variation(BCV)( √(ϕ_g )), which is plotted against gene abundance. The BCV metric represents variation due to biological processes, eliminating technical variability due to the sequencing process. (Laio 2016) The BCV is plotted against gene abundance to evaulate the quality of the transformation. Transformations that don’t suit the data would display high BCV values, differentiated clusters, conversely, conversely, a behaved BCV plot with cell-line RNA sees BCV values range between 0.05 and 0.5, with a downward trend in increasing expression. (Laio 2016)

#Plot the Biological Coefficient of Variation vs log2CountperMillion Abundance
plotBCV(y)

The BCV vs. gene abundance plot of this dataset behaves according to the authors standards, as such, the shrinkage and normalization transformation is well-suited to the Fulvestrant/MCF-7 dataset. (Lun 2015)

The normalized, shrunken values and a contrast matrix are used for fit values with glmQLfit() and the output is as follows:

#Fit a quasi-NB model with design matrix
fit <- glmQLFit(y,design)

The model fit contains gene-wise fitted values, coefficients, deviance and normalization and dispersion parameter values.

The edgeR function plotQLdisp() plots the fitted values and the raw gene counts in a quarter root mean deviance vs. average log2 CPM abundance plot.

#Plot shrinkage of counts
plotQLDisp(fit)

These fitted mean count values are used in a one-way ANOVA and F-test to determine differentially expressed genes. This is done by generating f-statistics for each gene and calculating the corresponding p-value. A significance threshold of 0.05 is set and the large false discovery rate is accounted for by a post-hoc benjamini-hochberg correction. Since a high false-discovery rate is expected due to the biological and technical variation combined with a small sample size and whole genome sequencing. Ultimately, the post-hoc reduces the influence of noise and should direct the list of DGE’s towards transcripts influenced by pharmacodynamic effects of Fulvestrant.

#F-test of fit
qlf <- glmQLFTest(fit,coef=2)
#topTags extracts the most statistically DEGs
TopGENES <- topTags(qlf)
#extract table of DEGS from s3 object
TopTable <- TopGENES[["table"]]
#use library(annotate) to include gene symbols 
gene_symbols <- getSYMBOL(c(rownames), data = 'org.Hs.eg.db')
#bind symbols with top DEgenes
ToptableSym <- cbind(gene_symbols, TopTable)
#display in a table
reactable(ToptableSym)

The topgene() function generates the table above, this function derives a list of top differentially expressed genes by evaluating the metrics contained in the above table. The p-values are then used to color up-regulated and down-regulated genes within the mean-difference plot, plotMD(), that is included below:

#plot log CPM vs. log-fold change with p-value 
#color scale to show magnitude of differential
#gene expression
plotMD(qlf, main = "Fulvestrant v. Vehicle")

Discussion: The identification of DEGs within Fulvestrant-treated cells with Rsubread and EdgeR proved to be a robust method, as the majority of the top DEG list have published research regarding their roles in estrogen-mediated tumor growth. Research into the top DEG CXCL12 reveals its importance in estrogen signaling, as CXCL12 is a key mediator of the growth effects of estrogen in MCF-7 cell lines.(Boudot 2011) This result does need further investigation, as it is shown to be upregulated, likely due to a contrast design error. The 2011 publication “Differential estrogen-regulation of CXCL12 chemokine receptors, CXCR4 and CXCR7, contributes to the growth effect of estrogens in breast cancer cells” by Kerdivel Boudot states a reduction in basal cell growth with CXCL12 downregulation or inhibition, which is the effect of Fulvestrant. GREB1 is an estrogen-regulated tumor promoter that also shows an upregulation, further supporting the improper contrast design. (Hodgkinson 2018) PDZK1 is identified as a growth-factor in estogen-mediated breast cancer growth by Hogyoung Kim et al. in 2013. Another mediator of estrogen/ER communication is GATA4, however, it is typically associated with osteoblast transcription. (Miranda-Carboni GA 2011) AQP3 is identified as a factor in estrogen-induced metastasis in ER+ breast cancer. (Yi-Ting 2015) SIAH2 (Scortegagna 2020), TFF1 (Bauche 2011) were published for their roles in cancer as well.

The identification of these differentially expressed genes that are relevant to the estrogen-mediated growth transcription pathways gives us important and significant results relevant to our hypothesis. The treatment of Fulvestrant does affect the gene expression profile of the MCF-7 breast cancer cell line.

Citations: Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26(1):139-140. doi:10.1093/bioinformatics/btp616

Danni Yu, Wolfgang Huber, Olga Vitek, Shrinkage estimation of dispersion in Negative Binomial models for RNA-seq experiments with small sample size, Bioinformatics, Volume 29, Issue 10, 15 May 2013, Pages 1275–1282, https://doi.org/10.1093/bioinformatics/btt143

Adrian V. Lee, Steffi Oesterreich, Nancy E. Davidson, MCF-7 Cells—Changing the Course of Breast Cancer Research and Care for 45 Years, JNCI: Journal of the National Cancer Institute, Volume 107, Issue 7, July 2015, djv073, https://doi.org/10.1093/jnci/djv073

Fulvestrant Injection: MedlinePlus Drug Information. (n.d.). Medlineplus.Gov. Retrieved December 18, 2020, from https://medlineplus.gov/druginfo/meds/a607031.html

SRA in the Cloud. (2020). National Center for Biotechnology Information. https://www.ncbi.nlm.nih.gov/sra/docs/sra-cloud/

Liao Y, Smyth GK and Shi W (2019). The R package Rsubread is easier,faster, cheaper and better for alignment andquantification of RNA sequencing reads. Nucleic Acids Research 47(8), e47.

The new, improved human genome. (2020). EMBL’s European Bionformatics Institute. https://www.ebi.ac.uk/about/news/press-releases/ensembl-grch38

H. Wickham. ggplot2:Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.

Mistry, M. R. K. (2020, June 9). From raw sequence reads to count matrix. Introduction to DGE. https://hbctraining.github.io/DGE_workshop_salmon_online/lessons/01a_RNAseq_processing_workflow.html

Scortegagna, M., Hockemeyer, K., Dolgalev, I. et al. Siah2 control of T-regulatory cells limits anti-tumor immunity. Nat Commun 11, 99 (2020). https://doi.org/10.1038/s41467-019-13826-7

Buache, E., Etique, N., Alpy, F. et al. Deficiency in trefoil factor 1 (TFF1) increases tumorigenicity of human breast cancer cells and mammary tumor development in TFF1-knockout mice. Oncogene 30, 3261–3273 (2011). https://doi.org/10.1038/onc.2011.41

Boudot A, Kerdivel G, Habauzit D, Eeckhoute J, Le Dily F, Flouriot G, Samson M, Pakdel F. Differential estrogen-regulation of CXCL12 chemokine receptors, CXCR4 and CXCR7, contributes to the growth effect of estrogens in breast cancer cells. PLoS One. 2011;6(6):e20898. doi: 10.1371/journal.pone.0020898. Epub 2011 Jun 10. PMID: 21695171; PMCID: PMC3112227.

Miranda-Carboni GA, Guemes M, Bailey S, Anaya E, Corselli M, Peault B, Krum SA. GATA4 regulates estrogen receptor-alpha-mediated osteoblast transcription. Mol Endocrinol. 2011 Jul;25(7):1126-36. doi: 10.1210/me.2010-0463. Epub 2011 May 12. PMID: 21566084; PMCID: PMC3125091.

Hodgkinson, K., Forrest, L.A., Vuong, N. et al. GREB1 is an estrogen receptor-regulated tumour promoter that is frequently expressed in ovarian cancer. Oncogene 37, 5873–5886 (2018). https://doi.org/10.1038/s41388-018-0377-y

Kim H, Abd Elmageed ZY, Ju J, Naura AS, Abdel-Mageed AB, Varughese S, Paul D, Alahari S, Catling A, Kim JG, Boulares AH. PDZK1 is a novel factor in breast cancer that is indirectly regulated by estrogen through IGF-1R and promotes estrogen-mediated growth. Mol Med. 2013 Aug 28;19(1):253-62. doi: 10.2119/molmed.2011.00001. PMID: 23821363; PMCID: PMC3769530.

Huang, Y. T., Zhou, J., Shi, S., Xu, H. Y., Qu, F., Zhang, D., Chen, Y. D., Yang, J., Huang, H. F., & Sheng, J. Z. (2015). Identification of Estrogen Response Element in Aquaporin-3 Gene that Mediates Estrogen-induced Cell Migration and Invasion in Estrogen Receptor-positive Breast Cancer. Scientific reports, 5, 12484. https://doi.org/10.1038/srep12484