miRNA and isomiR Differential Expression Analysis in Breast Cancer
Iris Mestres Pascual
In this study, the differential expression analysis of the results
provided by the MIRFLOWZ pipeline is performed. The main goal
is to asses if there are any miRNA species differentially expressed
across normal and tumor tissues. The conclusions are found in the main
manuscript. Here, only the workflow with step-by-step explanations is
reported.
Data Preprocessing
Data Loading and Sample Selection
The MIRFLOWZ pipeline has been ran with a specific set of
samples. Therefore, the metadata for the differential expression
analysis must contain exactly the same ones. This information is
provided in two different files, one containing data per case
and the other one with per sample information.
The per case file, contains data from NCI’s Cancer Genome
Atlas Program (TCGA). Thus, after loading the data frame, only the cases
belonging to the breast cancer (BRCA) study are kept.
raw_case_metadata <- read_excel("case_metadata.xlsx",
sheet = 1,
col_types = NULL)
case_metadata <- raw_case_metadata %>%
select(2:34) %>%
filter(type == "BRCA")
The per sample metadata is already from the TCGA-BRCA study
so, from the data set, a particular set of columns are selected; the
file name (File.Name), the case and sample ID
(Case.ID and Sample.ID respectively), and the
sample’s tissue type (Sample.Type). In addition, the file
name will be formatted as to match the sample names in the counts table
and an extra field with the sample’s plate is added.
raw_sample_metadata <- read.table("./sample_metadata.csv",
sep = "\t",
header = TRUE)
sample_metadata <- raw_sample_metadata %>%
select(c(File.Name, Case.ID, Sample.ID, Sample.Type))
for (i in 1:nrow(sample_metadata)){
sample_metadata$File.Name[i] <- substr(raw_sample_metadata$File.Name[i], 6, 28)
sample_metadata$Plate[i] <- strsplit(sample_metadata$File.Name[i], "-")[[1]][5]
}
Following the sample selection criteria used in the main report,
female cases containing solely a samples coming from a normal tissue and
another one from a primary tumor with no missing data are kept. In
addition, any case with a new tumor event or has died without tumors it
is removed. Moreover, only infiltrating ductal carcinoma bellow
Stage IIB cases are maintained. Finally, the sample’s tissue
types are modified to be labeled as either Normal or
Tumor.
sample_metadata <- sample_metadata %>%
group_by(Case.ID) %>%
filter( n() == 2 ,
any(Sample.Type == "Solid Tissue Normal"),
!Sample.Type == "Metastatic") %>%
arrange(Case.ID)
sample_metadata$Sample.Type[sample_metadata$Sample.Type == "Primary Tumor"] <- "Tumor"
sample_metadata$Sample.Type[sample_metadata$Sample.Type == "Solid Tissue Normal"] <- "Normal"
case_metadata <- case_metadata %>%
filter(bcr_patient_barcode %in% sample_metadata$Case.ID,
!(vital_status == "Dead"& tumor_status == "TUMOR FREE"),
gender == "FEMALE",
!is.na(tumor_status),
is.na(new_tumor_event_type),
race != "[Not Available]",
histological_type == "Infiltrating Ductal Carcinoma",
!ajcc_pathologic_tumor_stage %in% c("Stage IIIB",
"Stage IIIC",
"Stage IV")) %>%
select(bcr_patient_barcode,
age_at_initial_pathologic_diagnosis,
race,
ajcc_pathologic_tumor_stage,
vital_status,
tumor_status,
menopause_status,
initial_pathologic_dx_year,
birth_days_to,
death_days_to)
colnames(case_metadata)[1] <- "Case.ID"
sample_metadata <- sample_metadata %>%
filter(Case.ID %in% case_metadata$Case.ID) %>%
select(File.Name, Case.ID, Sample.Type, Plate) %>%
arrange(File.Name)
The filtered data sets are combined into a single one. The merge
leads to have samples coming from normal tissues to have a cancer stage
assigned thus, they are labeled as having no stage.
metadata <- sample_metadata %>%
inner_join(case_metadata, by = "Case.ID")
metadata <- column_to_rownames(metadata, var = "File.Name")
metadata$ajcc_pathologic_tumor_stage[metadata$Sample.Type == "Normal"] <- "None"
The sample’s count tables are joined together. There are miRNA
species present in one or several samples that are not in others. With
the merging, these counts are changed from NA to 0.
Furthermore, the sample file names are used as column names and, given
that the file names in the metadata and the counts table contain
different separator characters, the column names in the counts table are
changed to match the metadata. Finally, all the loaded information is
stored in an edgeR (Robinson, McCarthy, and
Smyth 2010).
options(scipen=999)
fnames <- list.files("./mirna_samples", full.names = T)
mirna_counts <- read.csv(fnames[1],sep = "\t", header = TRUE)
for (table in fnames[-1]){
table <- read.csv(table, header = TRUE, sep = "\t")
mirna_counts <- mirna_counts %>% full_join(table)
}
mirna_counts[is.na(mirna_counts)] <- 0
rownames(mirna_counts) <- mirna_counts[,1]
mirna_counts <- mirna_counts[,-1]
colnames(mirna_counts) <-gsub("\\.", "-", colnames(mirna_counts))
dge <- DGEList(counts = mirna_counts, group = as.factor(metadata$Sample.Type))
options(digits = 2)
Exploratory Data Analysis
In this exploratory data analysis the nature of the data is analyzed
from different perspectives as to identify trends, relationships and
possible anomalies such as outliers or batch effects.
Data Distribution
The first visualization focuses on the miRNA species counts per
sample as a way to get a glance of the count’s distribution. In the
following plot, each horizontal line represents a miRNA species. For an
easier interpretation, samples are colored.

The most notable aspect of the plot is the presence of low counts
between the different samples. Due to the MIRFLOWZ unambiguous
notation, there are very specific isomiRs present in low frequency in
just a few samples. Hence, this trend was expected. On the other hand, a
second set of miRNA species has a greater presence among the samples
with up to 1000000 counts and only four samples stand out, being the
higher count from the sample E2-A1LH-01A-11R-A14C-13 in the
hsa-miR-21-5p canonical miRNA.
From a statistical point of view, genes with consistently low counts
are very unlikely be assessed as significantly differentially expressed
because low counts do not provide enough statistical evidence for a
reliable judgement to be made. Such genes can therefore be removed from
the analysis without any loss of information.
In order to asses which miRNAs will be kept, the edgeR
function filterByExpr is used. This function bases the
filtering on count-per-million (CPM) values so as to avoid favoring
genes that are expressed in larger libraries over those expressed in
smaller libraries following a similar approach as the one used by Chen, Lun, and Smyth (2016).
mirna2keep <- filterByExpr(dge)
dge <- dge[mirna2keep, , keep.lib.sizes=F]
Furthermore, to eliminate composition biases between libraries, the
trimmed mean of M-values (TMM) variant proposed by Robinson and Oshlack (2010),
“TMM with singleton pairing” ( TMMwsp ), is applied to the data
set with the edgeR function calcNormFactors.
This variant is preferred over the normal TMM as it is intended to
perform better for data with a high proportion of zeros. As a result, an
extra field with the library scaling factors is added to the
DGEList object.
dge <- calcNormFactors(dge, method = "TMMwsp")
Principal Component Analysis
In order to accomplish a greater understanding of the data, a
Principal Component Analysis (PCA) is performed using the R
function prcomp. The PCA must be conducted over the
normalized counts which are obtained with the edgeR
function cpm. In addition, the normalized counts are
returned as their log2 value.
The main goal of this analysis is to assess whether the different
miRNA species are differentially expressed across tissues, each sample
is colored according to it. Mind that as the final plot will be
interactive, all the possible relevant information about each sample
will be displayed when hovering over it.
log_norm_mirna <- cpm(dge, log = T, normalized.lib.sizes = )
PCA <- prcomp(t(log_norm_mirna), scale. = T)
In this second visualization, three outliers are found: samples
BH-A0BZ-01A-31R-A12O-13,
BH-A1FN-01A-11R-A13P-13 and
E2-A1LH-01A-11R-A14C-13. Even if the former is the only one
that has been already spotted as such, all three are removed from the
data set. Prior to the PCA, the filtering of low expressed miRNAs and
the counts normalization are performed again in order to capture the
lacking of these miRNA species.
outliers <- c("BH-A0BZ-01A-31R-A12O-13",
"BH-A1FN-01A-11R-A13P-13",
"E2-A1LH-01A-11R-A14C-13")
dge <- dge[, !rownames(dge$samples) %in% outliers]
metadata <- metadata[!rownames(metadata) %in% outliers, ]
mirna2keep <- filterByExpr(dge)
dge <- dge[mirna2keep, , keep.lib.sizes = F]
dge <- calcNormFactors(dge, method = "TMMwsp")
log_norm_mirna <- cpm(dge, log = T, normalized.lib.sizes = T)
PCA <- prcomp(t(log_norm_mirna), scale. = T)
There is a clear tendency in the 4 different clusters. Some
parameters such as the sample’s vital status had been used in order to
assess if there is any relation between them and the clustering. At the
end, the sample’s origin plate and whether is present or not in both
kind of tissues is the cause of it.
Plates A090 and A057 exhibit this effect
the most followed by A085 (with a single sample) and
A010. From these four plates, plates A057,
A085 and A010 are the ones that were only used
for tumor samples, while plate A090 was only used for
normal tissue samples. It is important to mention that the small sample
set used in this study, could be the reason behind this effect to be so
large.
Differential Expression Analysis
It is of interest to see which miRNA species are expressed at
different levels across tissues. With this goal in mind, a linear model
is going to be fitted, to perform the differential expression
analysis.
Design Matrix and Dispersion Estimation
The first step towards the objective is to build a design matrix. The
design matrix links each tissue (columns) to the samples that belong to
it (rows). Even if a slightly batch effect has been observed due to the
plate of origin, the design in this study is considered to a one-way
design under the assumption that the difference observed is not due to
the plate of origin to be a potential confounder but an effect of the
sample size.
design_mat <- model.matrix(~0 + dge$samples$group)
colnames(design_mat) <- levels(dge$samples$group)
Normal Tumor
1 0 1
2 1 0
3 0 1
4 1 0
5 0 1
6 1 0
Due the nature of the data to be analyzed, there could be biological
variability that may turn the observed counts into noise. Estimating
this variability within the different tissues is going to help on
modeling the statistical uncertainty associated with the observed counts
and on making accurate inferences about differential expression between
them.
The dispersion estimates, are computed with the edgeR
function estimateDisp. This function uses the negative
binomial distribution to model the read counts per gene per sample. With
this function three different dispersion estimates are computed. The
first one is an empirical Bayes moderated dispersion for each individual
miRNA species (tagwise). The second one, the common dispersion,
is a global estimate averaged across all species. Finally, a predicted
dispersion for each miRNA is obtained from its abundance
(trended).
In addition, to guard the tagwise estimates from outlier
miRNAs with remarkably large or small individual dispersions, the
robust = TRUE option is used (Phipson et al.
2016). The dispersion estimates are visualized with the
edgeR function plotBCV. The biological
coefficient of variation in the y-axis, is the square root
of the negative binomial dispersion. In the x-axis the is
the mean of the log normalized counts.
dge <- estimateDisp(dge, design_mat, robust=TRUE)

A characteristic of RNA-seq studies, is the tendency for higher
negative binomial dispersions in genes with very low counts whereas the
dispersion trend decreases asymptoticaly to a constant value for genes
with higher counts (Chen, Lun, and Smyth
2016). Once more, due to the isomiRs naming strategy, the
data contains a lot of low counts even after removing the less
significant miRNA species.

In order to further account for technical variability. the negative
binomial model can be extended with quasi-likelihood methods (Lun, Chen,
and Smyth 2016) (Lund et al.
2012) . In this approach, the negative binomial dispersion
trend describes the biological variability across all miRNAs. This
overall trend is used to select the miRNA-specific variability that lay
above and below it. Individual dispersions (tagwise) are not
used.
Using the edgeR function glmQLFit several
empirical Bayes statistics such as the quasi-likelihood dispersion
trend, the squeezed quasi-likelihood dispersion estimates or the prior
degrees of freedom, as well as the estimated values of the generalized
linear model coefficients for each gene are computed. Incorporating
robust handling of zero counts, especially when the fitted values of
these counts yield useless residual degrees of freedom for the
quasi-likelihood dispersion estimation (Phipson et al.
2016), is a focal point of this function. To obtain the
squeezed quasi-likelihood dispersion estimates, the function tightens
the raw quasi-likelihood dispersion estimates to adhere to a global
trend. This serves to mitigate the uncertainty inherent in the
estimates, thereby enhancing the statistical power of subsequent tests.
Furthermore, the compression magnitude during the estimation is
contingent upon the value of the prior degrees of freedom. Substantial
prior degrees of freedom estimates indicate quasi-likelihood dispersions
with lower variability across miRNA species. In such scenario, it is
advisable to engage in robust empirical Bayes moderation. Conversely,
smaller prior degrees of freedom estimates imply a higher degree of
variability. In these cases, a milder moderation approach is more
appropriate. This modulation, ensures that the empirical Bayes procedure
adapts to the underlying characteristics of the data, yielding more
accurate and meaningful results.
Setting robust = TRUE. as to facilitate the generation
of miRNA-specific prior degrees of freedom estimates This approach
minimizes the likelihood of encountering false positives from miRNA
species with exceedingly high or low raw dispersions. Therefore, it
ensures that the analysis remains resilient against data aberrations
while amplifying its ability to identify significant differential
expression across the majority of miRNAs.
The new dispersion estimates are visualized with the
edgeR function plotQLDisp.
fit <- glmQLFit(dge, design_mat, robust=TRUE)

Differential Expression Analysis (p-value)
As a means to retrieve the upcoming results as a comparison between
normal and tumor tissues, a contrast matrix is build using the
limma (Ritchie et al.
2015) function makeContrasts.
contrast_mat <- makeContrasts(Normal - Tumor, levels = design_mat)
All the subsequent results will now be relative to normal tissues.
That is, a positive log2-fold-change (logFC) will designate
an over-expressed miRNA species in normal tissues. Conversely, a
negative logFC will denote a more highly expressed miRNA species in
tumor tissues.
To account for the uncertainty in dispersion estimation in the error
rate control, rather than the usual likelihood ratio test, the
quasi-likelihood F-tests is performed with the edgeR
function glmQLFTest. This approach leads to
p-values greater than or equal to the ones that the likelihood
ratio tests would had provided using the same negative binomial
dispersions.
Finally, the obtained p-values are submitted to a multiple
testing correction as to regulate the false discovery rate (FDR) using
the Benjamin-Hochberg method. The edgeR function
topTags provides a table with all the differentially
expressed miRNA species that have an adjusted p-value smaller
than or equal to 0.01. The decision of setting the significance level to
0.01 instead of the usual 0.05, is made to account for the small size of
the data set and the small batch effect previously seen. This stringent
significance level aims to reduce the likelihood of false positives that
the mentioned facts could originate. Note that the table with the
results is not shown as it is automatically added in the next plot.
res <- glmQLFTest(fit, contrast = contrast_mat)
top <- topTags(res, n = Inf, p.value = 0.01)
Using the Glimma function glimmaVolcano the
results of the previous analysis are displayed in an interactive
volcano-plot.
Differential Expression Analysis (fold-change)
The previous differential expression analysis, was based on
statistical significance. 2131 differentially expressed miRNA species
between the normal and tumor tissues have been found among the 3291
present in the data set. Having this huge amount of differentially
expressed miRNA species, incites narrowing down of the results to keep
only the ones that are more biologically meaningful.
This new approach focuses on spotting whether the differential
expression logFCs are significantly broad than log2(1.5) by
using the edgeR function glmTreat. This
method, prioritizes genes that are likely to have a higher biological
significance with larger changes. It is worth mentioning that
glmTreat does not set a fold change cutoff but evaluates
both, the magnitude of change of expressions values and its variability.
Thus, not only the logFC estimated value must be above the threshold,
but its whole confidence interval must.
res2 <- glmTreat(fit, contrast = contrast_mat, lfc = log2(1.5))
top2 <- topTags(res2, n = Inf, p.value = 0.01)
The final amount of differentially expressed miRNA species is 1440.
Once more, using the Glimma function
glimmaVolcano the results of the analysis are visualized in
an interactive volcano-plot.
Bibliography
Chen, Yunshun, Aaron T. L. Lun, and Gordon K. Smyth. 2016.
“From
Reads to Genes to Pathways: Differential Expression Analysis of
RNA-Seq Experiments Using Rsubread and the
edgeR Quasi-Likelihood Pipeline,” no.
5:1438 (August).
https://doi.org/10.12688/f1000research.8987.2.
Lun, Aaron T. L., Yunshun Chen, and Gordon K. Smyth. 2016.
“It’s
DE-licious: A Recipe for
Differential Expression Analyses of RNA-seq Experiments Using Quasi-Likelihood Methods
in edgeR.” In
Statistical
Genomics: Methods and
Protocols, edited by Ewy Mathé and Sean Davis,
391–416. Methods in
Molecular Biology.
New York,
NY:
Springer.
https://doi.org/10.1007/978-1-4939-3578-9_19.
Lund, Steven P., Dan Nettleton, Davis J. McCarthy, and Gordon K. Smyth.
2012.
“Detecting Differential Expression in RNA-sequence Data Using Quasi-Likelihood with
Shrunken Dispersion Estimates.” Statistical Applications in
Genetics and Molecular Biology 11 (5):
/j/sagmb.2012.11.issue-5/1544-6115.1826/1544-6115.1826.xml.
https://doi.org/10.1515/1544-6115.1826.
Phipson, Belinda, Stanley Lee, Ian J. Majewski, Warren S. Alexander, and
Gordon K. Smyth. 2016.
“Robust Hyperparameter Estimation Protects
Against Hypervariable Genes and Improves Power to Detect Differential
Expression.” The Annals of Applied Statistics 10 (2):
946–63.
https://doi.org/10.1214/16-AOAS920.
Ritchie, Matthew E., Belinda Phipson, Di Wu, Yifang Hu, Charity W. Law,
Wei Shi, and Gordon K. Smyth. 2015.
“Limma Powers Differential
Expression Analyses for RNA-sequencing and
Microarray Studies.” Nucleic Acids Research 43 (7): e47.
https://doi.org/10.1093/nar/gkv007.
Robinson, Mark D., Davis J. McCarthy, and Gordon K. Smyth. 2010.
“edgeR: A Bioconductor
Package for Differential Expression Analysis of Digital Gene Expression
Data.” Bioinformatics 26 (1): 139–40.
https://doi.org/10.1093/bioinformatics/btp616.
Robinson, Mark D., and Alicia Oshlack. 2010.
“A Scaling
Normalization Method for Differential Expression Analysis of RNA-seq Data.” Genome Biology 11
(3): R25.
https://doi.org/10.1186/gb-2010-11-3-r25.
---
bibliography: ../fdp.bib
link-citations: true
output:
  html_document:
    css: bootstrap.css
    highlight: haddock
    code_download: TRUE
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, comment = NA, warning = FALSE, message = FALSE)

library(dplyr)
library(edgeR)
library(factoextra)
library(ggplot2)
library(Glimma)
library(kableExtra)
library(knitr)
library(pheatmap)
library(plotly)
library(RColorBrewer)
library(readxl)
library(reshape2)
library(statmod)
library(tidyverse)

# Changing default table's style
kable <- function(data, ...){
        knitr::kable(data, 
                     format = "html",
                     digits = 3,
                     format.args = list(scientific = TRUE),
                     align = 'c',
                     linesep = '') %>%
        kable_paper() %>%
        scroll_box(width = "600px", height = "400px")
}

knit_print.data.frame <- function(x, ...){
        res <- paste(c('', '', kable(x)),
                     collapse = "\n")
        asis_output(res)
}

registerS3method("knit_print", 
                 "data.frame",
                 knit_print.data.frame)

# Print nothing if NA within a table
options(knit.kable.table.NA = '')

theme_set(theme_classic())
```

<br>

<h1>miRNA and isomiR Differential Expression Analysis in Breast Cancer</h1>

<h3 align="right">Iris Mestres Pascual</h3>

<hr>

<br>

In this study, the differential expression analysis of the results provided by
the *MIRFLOWZ* pipeline is performed. The main goal is to asses if there are any
miRNA species differentially expressed across normal and tumor tissues. The
conclusions are found in the main manuscript. Here, only the workflow with
step-by-step explanations is reported.

<br>

<h1>Data Preprocessing</h1>
<hr class="half-width">

<br>

<h2 align="right">Data Loading and Sample Selection</h2>

The *MIRFLOWZ* pipeline has been ran with a specific set of samples. Therefore, 
the metadata for the differential expression analysis must contain exactly the 
same ones. This information is provided in two different files, one containing 
data *per case* and the other one with *per sample* information.

The *per case* file, contains data from NCI's Cancer Genome Atlas Program 
(TCGA). Thus, after loading the data frame, only the cases belonging to the 
breast cancer (BRCA) study are kept.

```{r case_metadata, warning=FALSE}
raw_case_metadata <- read_excel("case_metadata.xlsx",
                           sheet = 1,
                           col_types = NULL)

case_metadata <- raw_case_metadata %>% 
  select(2:34) %>% 
  filter(type == "BRCA")
```

<br>

The *per sample* metadata is already from the TCGA-BRCA study so, from the data
set, a particular set of columns are selected; the file name (`File.Name`), the
case and sample ID (`Case.ID` and `Sample.ID` respectively), and the sample's
tissue type (`Sample.Type`). In addition, the file name will be formatted as to
match the sample names in the counts table and an extra field with the sample's
plate is added.

```{r sample_metadata}
raw_sample_metadata <- read.table("./sample_metadata.csv",
                                  sep = "\t",
                                  header = TRUE)

sample_metadata <- raw_sample_metadata %>% 
  select(c(File.Name, Case.ID, Sample.ID, Sample.Type))

for (i in 1:nrow(sample_metadata)){
  sample_metadata$File.Name[i] <- substr(raw_sample_metadata$File.Name[i], 6, 28)
  sample_metadata$Plate[i] <- strsplit(sample_metadata$File.Name[i], "-")[[1]][5]
}
```

<br>

Following the sample selection criteria used in the main report, female cases
containing solely a samples coming from a normal tissue and another one from a
primary tumor with no missing data are kept. In addition, any case with a new
tumor event or has died without tumors it is removed. Moreover, only
infiltrating ductal carcinoma bellow *Stage IIB* cases are maintained. Finally,
the sample's tissue types are modified to be labeled as either `Normal` or
`Tumor`.

```{r sample_selection}
sample_metadata <- sample_metadata %>% 
  group_by(Case.ID) %>% 
  filter( n() == 2 ,
          any(Sample.Type == "Solid Tissue Normal"), 
          !Sample.Type == "Metastatic") %>%
  arrange(Case.ID)

sample_metadata$Sample.Type[sample_metadata$Sample.Type == "Primary Tumor"] <- "Tumor"
sample_metadata$Sample.Type[sample_metadata$Sample.Type == "Solid Tissue Normal"] <- "Normal"

case_metadata <- case_metadata %>%
  filter(bcr_patient_barcode %in% sample_metadata$Case.ID,
         !(vital_status == "Dead"& tumor_status == "TUMOR FREE"),
         gender == "FEMALE",
         !is.na(tumor_status),
         is.na(new_tumor_event_type),
         race != "[Not Available]",
         histological_type == "Infiltrating Ductal Carcinoma",
         !ajcc_pathologic_tumor_stage %in% c("Stage IIIB", 
                                             "Stage IIIC", 
                                             "Stage IV")) %>%
  select(bcr_patient_barcode, 
         age_at_initial_pathologic_diagnosis,
         race, 
         ajcc_pathologic_tumor_stage,
         vital_status,
         tumor_status, 
         menopause_status, 
         initial_pathologic_dx_year, 
         birth_days_to, 
         death_days_to) 
colnames(case_metadata)[1] <- "Case.ID"

sample_metadata <- sample_metadata %>%
  filter(Case.ID %in% case_metadata$Case.ID) %>%
  select(File.Name, Case.ID, Sample.Type, Plate) %>%
  arrange(File.Name)
```

<br>

The filtered data sets are combined into a single one. The merge leads to have
samples coming from normal tissues to have a cancer stage assigned thus, they
are labeled as having no stage.

```{r merge_metadata}
metadata <- sample_metadata %>% 
  inner_join(case_metadata, by = "Case.ID")

metadata <- column_to_rownames(metadata, var = "File.Name")

metadata$ajcc_pathologic_tumor_stage[metadata$Sample.Type == "Normal"] <- "None"
```

<br>

The sample's count tables are joined together. There are miRNA species present
in one or several samples that are not in others. With the merging, these counts
are changed from `NA` to 0. Furthermore, the sample file names are used as
column names and, given that the file names in the metadata and the counts table
contain different separator characters, the column names in the counts table are
changed to match the metadata. Finally, all the loaded information is stored in
an `edgeR` [@robinsonEdgeRBioconductorPackage2010].

```{r edgeR_mir-tables}
options(scipen=999)

fnames <- list.files("./mirna_samples", full.names = T)
mirna_counts <-  read.csv(fnames[1],sep = "\t", header = TRUE)

for (table in fnames[-1]){
  table <- read.csv(table, header = TRUE, sep = "\t")
  mirna_counts <- mirna_counts %>% full_join(table)
}

mirna_counts[is.na(mirna_counts)] <- 0

rownames(mirna_counts) <- mirna_counts[,1]
mirna_counts <- mirna_counts[,-1]

colnames(mirna_counts) <-gsub("\\.", "-", colnames(mirna_counts))

dge <- DGEList(counts = mirna_counts, group = as.factor(metadata$Sample.Type))
options(digits = 2)
```

<br>

<h2 align="right">Exploratory Data Analysis</h2>

In this exploratory data analysis the nature of the data is analyzed from
different perspectives as to identify trends, relationships and possible
anomalies such as outliers or batch effects.

<h3 align="right">Data Distribution</h3>

The first visualization focuses on the miRNA species counts per sample as a
way to get a glance of the count's distribution. In the following plot, each
horizontal line represents a miRNA species. For an easier interpretation,
samples are colored.

```{r countpersample, echo=F, fig.align = 'center'}
mirna_melt <- melt(as.matrix(dge$counts))

t <- ggplot(mirna_melt, aes(x = Var2, 
                            y = value, 
                            group = Var1, 
                            color = Var2)) +
  geom_line(stat = "identity") +
  labs( title = "miRNA Species Counts per Sample", 
        x = "Sample", y = "Counts") +
  theme(legend.position = "none", 
        axis.text.x = element_blank())
t
```

<br>

The most notable aspect of the plot is the presence of low counts between the
different samples. Due to the *MIRFLOWZ* unambiguous notation, there are very
specific isomiRs present in low frequency in just a few samples. Hence, this
trend was expected. On the other hand, a second set of miRNA species has a
greater presence among the samples with up to 1000000 counts and only four
samples stand out, being the higher count from the sample
`E2-A1LH-01A-11R-A14C-13` in the `hsa-miR-21-5p` canonical miRNA.

From a statistical point of view, genes with consistently low counts are very
unlikely be assessed as significantly differentially expressed because low
counts do not provide enough statistical evidence for a reliable judgement to be
made. Such genes can therefore be removed from the analysis without any loss of
information.

In order to asses which miRNAs will be kept, the `edgeR` function `filterByExpr`
is used. This function bases the filtering on count-per-million (CPM) values so
as to avoid favoring genes that are expressed in larger libraries over those
expressed in smaller libraries following a similar approach as the one used by
@chenReadsGenesPathways2016.

```{r mirna2keep}
mirna2keep <- filterByExpr(dge)
dge <- dge[mirna2keep, , keep.lib.sizes=F]
```

<br>

Furthermore, to eliminate composition biases between libraries, the trimmed mean
of M-values (TMM) variant proposed by @robinsonScalingNormalizationMethod2010a,
"TMM with singleton pairing" ( *TMMwsp* ), is applied to the data set with the
`edgeR` function `calcNormFactors`. This variant is preferred over the normal
TMM as it is intended to perform better for data with a high proportion of
zeros. As a result, an extra field with the library scaling factors is added to
the `DGEList` object.

```{r normalization}
dge <- calcNormFactors(dge, method = "TMMwsp")
```

<br>

<h3 align="right">Principal Component Analysis</h3>

In order to accomplish a greater understanding of the data, a Principal
Component Analysis (PCA) is performed using the `R` function `prcomp`. The PCA
must be conducted over the normalized counts which are obtained with the `edgeR`
function `cpm`. In addition, the normalized counts are returned as their log~2~
value.

The main goal of this analysis is to assess whether the different miRNA species
are differentially expressed across tissues, each sample is colored according to
it. Mind that as the final plot will be interactive, all the possible relevant
information about each sample will be displayed when hovering over it.

```{r tissue PCA}
log_norm_mirna <- cpm(dge, log = T, normalized.lib.sizes = )
PCA <- prcomp(t(log_norm_mirna), scale. = T)
```

```{r plot_tissue-PCA, echo=F, fig.align = 'center'}
components <- data.frame(PCA$x)

Case <- metadata$Case.ID
Sample <- rownames(components)
Stage <- metadata$ajcc_pathologic_tumor_stage
Tissue <- metadata$Sample.Type
"Vital status" <- metadata$vital_status

gg <- ggplot(components,aes(x = PC1, y= PC2, 
                            case = Case,
                            sample = Sample,
                            shape = Tissue, 
                            color = Tissue, 
                            stage = Stage, 
                            vital = `Vital status`)) +
    geom_point(alpha = 0.8, size = 4) +
  geom_hline(yintercept=0, linetype="dashed", alpha = 0.2) +
  geom_vline(xintercept=0, linetype="dashed", alpha = 0.2) +
  labs(title = "miRNA Expression Across Samples PCA", 
       x = "PC1 (23.01%)", 
       y = "PC2 (9.35%)") +
  scale_color_manual(values=c("#40E0D0","#8A2BE2")) +
  scale_shape_manual(values = c(1, 5))

ggplotly(gg)
```

<br>

In this second visualization, three outliers are found: samples
`BH-A0BZ-01A-31R-A12O-13`, `BH-A1FN-01A-11R-A13P-13` and
`E2-A1LH-01A-11R-A14C-13`. Even if the former is the only one that has been
already spotted as such, all three are removed from the data set. Prior to the
PCA, the filtering of low expressed miRNAs and the counts normalization are
performed again in order to capture the lacking of these miRNA species.

```{r removeOutliers}
outliers <-  c("BH-A0BZ-01A-31R-A12O-13",
               "BH-A1FN-01A-11R-A13P-13",
               "E2-A1LH-01A-11R-A14C-13")

dge <- dge[, !rownames(dge$samples) %in% outliers]
metadata <- metadata[!rownames(metadata) %in% outliers, ]

mirna2keep <- filterByExpr(dge)
dge <- dge[mirna2keep, , keep.lib.sizes = F]

dge <- calcNormFactors(dge, method = "TMMwsp")
log_norm_mirna <- cpm(dge, log = T, normalized.lib.sizes = T)

PCA <- prcomp(t(log_norm_mirna), scale. = T)
```

```{r plot_removeOutliers, echo=F, fig.align = 'center'}
components <- data.frame(PCA$x)

Case <- metadata$Case.ID
Sample <- rownames(components)
Stage <- metadata$ajcc_pathologic_tumor_stage
Tissue <- metadata$Sample.Type
"Vital status" <- metadata$vital_status

gg <- ggplot(components,aes(x = PC1, y= PC2, 
                            case = Case,
                            sample = Sample,
                            shape = Tissue, 
                            color = Tissue, 
                            stage = Stage, 
                            vital = `Vital status`)) +
    geom_point(alpha = 0.8, size = 4) +
  geom_hline(yintercept=0, linetype="dashed", alpha = 0.2) +
  geom_vline(xintercept=0, linetype="dashed", alpha = 0.2) +
  labs(title = "miRNA Expression Across Samples PCA\nwithout outliers", 
       x = "PC1 (27.37%)", 
       y = "PC2 (14.02%)") +
  scale_color_manual(values=c("#40E0D0","#8A2BE2")) +
  scale_shape_manual(values = c(1, 5))

ggplotly(gg)
```

<br>

There is a clear tendency in the 4 different clusters. Some parameters such as
the sample's vital status had been used in order to assess if there is any
relation between them and the clustering. At the end, the sample's origin plate
and whether is present or not in both kind of tissues is the cause of it.

```{r plot_plate-PCA, echo=F, fig.align = 'center'}
Plate <- metadata$Plate

plates_col = c( "#4B0082", "#00BFFF", "#FF0000", "#EB22BE", "#7CFC00",
                "#FF1493", "#8A2BE2", "#228B22", "#C71585", "#FF8C00",
                "#9900CC", "#A52A2A", "#00FFFF", "#FFD700", "#00FA9A")

gg <- ggplot(components,aes(x = PC1, y= PC2, 
                            case = Case,
                            sample = Sample,
                            stage = Stage, 
                            color = Plate,
                            shape = Tissue,
                            vital = `Vital status`)) +
    geom_point(alpha = 0.8, size = 4) +
  geom_hline(yintercept=0, linetype="dashed", alpha = 0.2) +
  geom_vline(xintercept=0, linetype="dashed", alpha = 0.2) +
  labs(title = "miRNA Expression Across Samples per Plate PCA", 
       x = "PC1 (27.37%)", 
       y = "PC2 (14.02%)") +
  scale_color_manual(values = plates_col) +
  scale_shape_manual(values = c(1, 5)) 

ggplotly(gg)
```

<br>

Plates `A090` and `A057` exhibit this effect the most followed by `A085` (with a
single sample) and `A010`. From these four plates, plates `A057`, `A085` and 
`A010` are the ones that were only used for tumor samples, while plate `A090`
was only used for normal tissue samples. It is important to mention that the
small sample set used in this study, could be the reason behind this effect to
be so large.

<br>

<h1>Differential Expression Analysis</h1>
<hr class="half-width2">

It is of interest to see which miRNA species are expressed at different levels
across tissues. With this goal in mind, a linear model is going to be fitted, to
perform the differential expression analysis.

<br>

<h2 align="right">Design Matrix and Dispersion Estimation</h2>

The first step towards the objective is to build a design matrix. The design
matrix links each tissue (columns) to the samples that belong to it (rows). Even
if a slightly batch effect has been observed due to the plate of origin, the
design in this study is considered to a one-way design under the assumption that
the difference observed is not due to the plate of origin to be a potential
confounder but an effect of the sample size.

```{r design mat.v2}
design_mat <- model.matrix(~0 + dge$samples$group)
colnames(design_mat) <- levels(dge$samples$group)
```

```{r view_design-mat, echo=F}
head(design_mat)
```

<br>

Due the nature of the data to be analyzed, there could be biological variability
that may turn the observed counts into noise. Estimating this variability within
the different tissues is going to help on modeling the statistical uncertainty
associated with the observed counts and on making accurate inferences about
differential expression between them.

The dispersion estimates, are computed with the `edgeR` function `estimateDisp`.
This function uses the negative binomial distribution to model the read counts
per gene per sample. With this function three different dispersion estimates are
computed. The first one is an empirical Bayes moderated dispersion for each
individual miRNA species (*tagwise*). The second one, the common dispersion, is
a global estimate averaged across all species. Finally, a predicted dispersion
for each miRNA is obtained from its abundance (*trended*).

In addition, to guard the *tagwise* estimates from outlier miRNAs with
remarkably large or small individual dispersions, the `robust = TRUE` option is
used [@phipsonRobustHyperparameterEstimation2016]. The dispersion estimates are
visualized with the `edgeR` function `plotBCV`. The biological coefficient of
variation in the `y-axis`, is the square root of the negative binomial
dispersion. In the `x-axis` the is the mean of the log normalized counts.

```{r disp.estimat}
dge <- estimateDisp(dge, design_mat, robust=TRUE)
```

```{r plot_dis.estimat, echo=F, fig.align = 'center'}
plotBCV(dge, main = "miRNA Counts Dispersion Estimates", 
        frame.plot=FALSE , 
        col.trend = "#8A2BE2", col.common = "#00BFFF")
box(bty="l")
axis(2)
axis(1)
```

<br>

A characteristic of RNA-seq studies, is the tendency for higher negative
binomial dispersions in genes with very low counts whereas the dispersion trend
decreases asymptoticaly to a constant value for genes with higher counts
[@chenReadsGenesPathways2016]. Once more, due to the isomiRs naming strategy,
the data contains a lot of low counts even after removing the less significant
miRNA species.

```{r plot_counts-dist, echo=F, fig.align = 'center'}
mirna_melt <- melt(log_norm_mirna)

ggplot(mirna_melt, aes(x = value)) +
  geom_histogram(binwidth = 0.5, 
                 color = "#FFFFFF", 
                 fill = "#9900CC",
                 alpha = 0.7) +
  labs( title = "Normalized log Counts Distribution", 
        x = "Counts", y = "miRNA species") +
  theme(legend.position = "none")
```

<br>

In order to further account for technical variability. the negative binomial
model can be extended with quasi-likelihood methods [@lunItDEliciousRecipe2016a]
[@lundDetectingDifferentialExpression2012] . In this approach, the negative
binomial dispersion trend describes the biological variability across all
miRNAs. This overall trend is used to select the miRNA-specific variability that
lay above and below it. Individual dispersions (*tagwise*) are not used.

Using the `edgeR` function `glmQLFit` several empirical Bayes statistics such as
the quasi-likelihood dispersion trend, the squeezed quasi-likelihood dispersion
estimates or the prior degrees of freedom, as well as the estimated values of
the generalized linear model coefficients for each gene are computed.
Incorporating robust handling of zero counts, especially when the fitted values
of these counts yield useless residual degrees of freedom for the
quasi-likelihood dispersion estimation 
[@phipsonRobustHyperparameterEstimation2016], is a focal point of this function.
To obtain the squeezed quasi-likelihood dispersion estimates, the function
tightens the raw quasi-likelihood dispersion estimates to adhere to a global
trend. This serves to mitigate the uncertainty inherent in the estimates,
thereby enhancing the statistical power of subsequent tests. Furthermore, the
compression magnitude during the estimation is contingent upon the value of the
prior degrees of freedom. Substantial prior degrees of freedom estimates
indicate quasi-likelihood dispersions with lower variability across miRNA
species. In such scenario, it is advisable to engage in robust empirical Bayes
moderation. Conversely, smaller prior degrees of freedom estimates imply a 
higher degree of variability. In these cases, a milder moderation approach is 
more appropriate. This modulation, ensures that the empirical Bayes procedure
adapts to the underlying characteristics of the data, yielding more accurate and
meaningful results.

Setting `robust = TRUE`. as to facilitate the generation of miRNA-specific prior
degrees of freedom estimates This approach minimizes the likelihood of 
encountering false positives from miRNA species with exceedingly high or low raw
dispersions. Therefore, it ensures that the analysis remains resilient against 
data aberrations while amplifying its ability to identify significant 
differential expression across the majority of miRNAs.

The new dispersion estimates are visualized with the `edgeR` function
`plotQLDisp`.

```{r quasi-likeliFit}
fit <- glmQLFit(dge, design_mat, robust=TRUE)
```

```{r plot_quasi-likeliFit, echo=F, fig.align = 'center'}
plotQLDisp(fit,  main = "miRNA Counts Quasi-Likelihood Dispersion Estimates", 
           frame.plot=FALSE , 
           col.trend = "#8A2BE2", col.shrunk = "#00BFFF")
box(bty="l")
axis(2)
axis(1)
```

<br>

<h2 align="right">Differential Expression Analysis (p-value)</h2>

As a means to retrieve the upcoming results as a comparison between normal and
tumor tissues, a contrast matrix is build using the `limma`
[@ritchieLimmaPowersDifferential2015] function `makeContrasts`.

```{r contrast_mat}
contrast_mat <- makeContrasts(Normal - Tumor, levels = design_mat)
```

<br>

All the subsequent results will now be relative to normal tissues. That is, a
positive log~2~-fold-change (logFC) will designate an over-expressed miRNA 
species in normal tissues. Conversely, a negative logFC will denote a more
highly expressed miRNA species in tumor tissues.

To account for the uncertainty in dispersion estimation in the error rate
control, rather than the usual likelihood ratio test, the quasi-likelihood
F-tests is performed with the `edgeR` function `glmQLFTest`. This approach leads
to *p-values* greater than or equal to the ones that the likelihood ratio tests
would had provided using the same negative binomial dispersions.

Finally, the obtained *p-values* are submitted to a multiple testing correction
as to regulate the false discovery rate (FDR) using the Benjamin-Hochberg
method. The `edgeR` function `topTags` provides a table with all the
differentially expressed miRNA species that have an adjusted *p-value* smaller
than or equal to 0.01. The decision of setting the significance level to 0.01
instead of the usual 0.05, is made to account for the small size of the data set
and the small batch effect previously seen. This stringent significance level
aims to reduce the likelihood of false positives that the mentioned facts could
originate. Note that the table with the results is not shown as it is
automatically added in the next plot.

```{r qlfTest}
res <- glmQLFTest(fit, contrast = contrast_mat)
top <- topTags(res, n = Inf, p.value = 0.01)
```

<br>

Using the `Glimma` function `glimmaVolcano` the results of the previous analysis
are displayed in an interactive volcano-plot.

```{r volcano, echo=F, fig.align = 'center'}
glimmaVolcano(res, dge = fit, 
              status = decideTestsDGE(res, p.value = 0.01),
              main = "Normal vs Tumor")
```

<br>

<h2 align="right">Differential Expression Analysis (fold-change)</h2>

The previous differential expression analysis, was based on statistical significance. 2131 differentially expressed miRNA species between the normal and
tumor tissues have been found among the 3291 present in the data set. Having
this huge amount of differentially expressed miRNA species, incites narrowing
down of the results to keep only the ones that are more biologically meaningful.

This new approach focuses on spotting whether the differential expression logFCs
are significantly broad than log~2~(1.5) by using the `edgeR` function
`glmTreat`. This method, prioritizes genes that are likely to have a higher
biological significance with larger changes. It is worth mentioning that
`glmTreat` does not set a fold change cutoff but evaluates both, the magnitude
of change of expressions values and its variability. Thus, not only the logFC
estimated value must be above the threshold, but its whole confidence interval
must.

```{r glmTreat}
res2 <- glmTreat(fit, contrast = contrast_mat, lfc = log2(1.5))
top2 <- topTags(res2, n = Inf, p.value = 0.01)
```

<br>

The final amount of differentially expressed miRNA species is 1440. Once more,
using the `Glimma` function `glimmaVolcano` the results of the analysis are
visualized in an interactive volcano-plot.

```{r volcano2, echo=F, fig.align = 'center'}
glimmaVolcano(res2, dge=fit, 
              status = decideTestsDGE(res2, p.value = 0.01), 
              main = "Normal vs Tumor")
```

<br>

# Bibliography

<hr class="half-width">

<br>
