miRNA and isomiR Differential Expression Analysis in Breast Cancer

Iris Mestres Pascual



In this study, the differential expression analysis of the results provided by the MIRFLOWZ pipeline is performed. The main goal is to asses if there are any miRNA species differentially expressed across normal and tumor tissues. The conclusions are found in the main manuscript. Here, only the workflow with step-by-step explanations is reported.


Data Preprocessing



Data Loading and Sample Selection

The MIRFLOWZ pipeline has been ran with a specific set of samples. Therefore, the metadata for the differential expression analysis must contain exactly the same ones. This information is provided in two different files, one containing data per case and the other one with per sample information.

The per case file, contains data from NCI’s Cancer Genome Atlas Program (TCGA). Thus, after loading the data frame, only the cases belonging to the breast cancer (BRCA) study are kept.

raw_case_metadata <- read_excel("case_metadata.xlsx",
                           sheet = 1,
                           col_types = NULL)

case_metadata <- raw_case_metadata %>% 
  select(2:34) %>% 
  filter(type == "BRCA")


The per sample metadata is already from the TCGA-BRCA study so, from the data set, a particular set of columns are selected; the file name (File.Name), the case and sample ID (Case.ID and Sample.ID respectively), and the sample’s tissue type (Sample.Type). In addition, the file name will be formatted as to match the sample names in the counts table and an extra field with the sample’s plate is added.

raw_sample_metadata <- read.table("./sample_metadata.csv",
                                  sep = "\t",
                                  header = TRUE)

sample_metadata <- raw_sample_metadata %>% 
  select(c(File.Name, Case.ID, Sample.ID, Sample.Type))

for (i in 1:nrow(sample_metadata)){
  sample_metadata$File.Name[i] <- substr(raw_sample_metadata$File.Name[i], 6, 28)
  sample_metadata$Plate[i] <- strsplit(sample_metadata$File.Name[i], "-")[[1]][5]
}


Following the sample selection criteria used in the main report, female cases containing solely a samples coming from a normal tissue and another one from a primary tumor with no missing data are kept. In addition, any case with a new tumor event or has died without tumors it is removed. Moreover, only infiltrating ductal carcinoma bellow Stage IIB cases are maintained. Finally, the sample’s tissue types are modified to be labeled as either Normal or Tumor.

sample_metadata <- sample_metadata %>% 
  group_by(Case.ID) %>% 
  filter( n() == 2 ,
          any(Sample.Type == "Solid Tissue Normal"), 
          !Sample.Type == "Metastatic") %>%
  arrange(Case.ID)

sample_metadata$Sample.Type[sample_metadata$Sample.Type == "Primary Tumor"] <- "Tumor"
sample_metadata$Sample.Type[sample_metadata$Sample.Type == "Solid Tissue Normal"] <- "Normal"

case_metadata <- case_metadata %>%
  filter(bcr_patient_barcode %in% sample_metadata$Case.ID,
         !(vital_status == "Dead"& tumor_status == "TUMOR FREE"),
         gender == "FEMALE",
         !is.na(tumor_status),
         is.na(new_tumor_event_type),
         race != "[Not Available]",
         histological_type == "Infiltrating Ductal Carcinoma",
         !ajcc_pathologic_tumor_stage %in% c("Stage IIIB", 
                                             "Stage IIIC", 
                                             "Stage IV")) %>%
  select(bcr_patient_barcode, 
         age_at_initial_pathologic_diagnosis,
         race, 
         ajcc_pathologic_tumor_stage,
         vital_status,
         tumor_status, 
         menopause_status, 
         initial_pathologic_dx_year, 
         birth_days_to, 
         death_days_to) 
colnames(case_metadata)[1] <- "Case.ID"

sample_metadata <- sample_metadata %>%
  filter(Case.ID %in% case_metadata$Case.ID) %>%
  select(File.Name, Case.ID, Sample.Type, Plate) %>%
  arrange(File.Name)


The filtered data sets are combined into a single one. The merge leads to have samples coming from normal tissues to have a cancer stage assigned thus, they are labeled as having no stage.

metadata <- sample_metadata %>% 
  inner_join(case_metadata, by = "Case.ID")

metadata <- column_to_rownames(metadata, var = "File.Name")

metadata$ajcc_pathologic_tumor_stage[metadata$Sample.Type == "Normal"] <- "None"


The sample’s count tables are joined together. There are miRNA species present in one or several samples that are not in others. With the merging, these counts are changed from NA to 0. Furthermore, the sample file names are used as column names and, given that the file names in the metadata and the counts table contain different separator characters, the column names in the counts table are changed to match the metadata. Finally, all the loaded information is stored in an edgeR (Robinson, McCarthy, and Smyth 2010).

options(scipen=999)

fnames <- list.files("./mirna_samples", full.names = T)
mirna_counts <-  read.csv(fnames[1],sep = "\t", header = TRUE)

for (table in fnames[-1]){
  table <- read.csv(table, header = TRUE, sep = "\t")
  mirna_counts <- mirna_counts %>% full_join(table)
}

mirna_counts[is.na(mirna_counts)] <- 0

rownames(mirna_counts) <- mirna_counts[,1]
mirna_counts <- mirna_counts[,-1]

colnames(mirna_counts) <-gsub("\\.", "-", colnames(mirna_counts))

dge <- DGEList(counts = mirna_counts, group = as.factor(metadata$Sample.Type))
options(digits = 2)


Exploratory Data Analysis

In this exploratory data analysis the nature of the data is analyzed from different perspectives as to identify trends, relationships and possible anomalies such as outliers or batch effects.

Data Distribution

The first visualization focuses on the miRNA species counts per sample as a way to get a glance of the count’s distribution. In the following plot, each horizontal line represents a miRNA species. For an easier interpretation, samples are colored.


The most notable aspect of the plot is the presence of low counts between the different samples. Due to the MIRFLOWZ unambiguous notation, there are very specific isomiRs present in low frequency in just a few samples. Hence, this trend was expected. On the other hand, a second set of miRNA species has a greater presence among the samples with up to 1000000 counts and only four samples stand out, being the higher count from the sample E2-A1LH-01A-11R-A14C-13 in the hsa-miR-21-5p canonical miRNA.

From a statistical point of view, genes with consistently low counts are very unlikely be assessed as significantly differentially expressed because low counts do not provide enough statistical evidence for a reliable judgement to be made. Such genes can therefore be removed from the analysis without any loss of information.

In order to asses which miRNAs will be kept, the edgeR function filterByExpr is used. This function bases the filtering on count-per-million (CPM) values so as to avoid favoring genes that are expressed in larger libraries over those expressed in smaller libraries following a similar approach as the one used by Chen, Lun, and Smyth (2016).

mirna2keep <- filterByExpr(dge)
dge <- dge[mirna2keep, , keep.lib.sizes=F]


Furthermore, to eliminate composition biases between libraries, the trimmed mean of M-values (TMM) variant proposed by Robinson and Oshlack (2010), “TMM with singleton pairing” ( TMMwsp ), is applied to the data set with the edgeR function calcNormFactors. This variant is preferred over the normal TMM as it is intended to perform better for data with a high proportion of zeros. As a result, an extra field with the library scaling factors is added to the DGEList object.

dge <- calcNormFactors(dge, method = "TMMwsp")


Principal Component Analysis

In order to accomplish a greater understanding of the data, a Principal Component Analysis (PCA) is performed using the R function prcomp. The PCA must be conducted over the normalized counts which are obtained with the edgeR function cpm. In addition, the normalized counts are returned as their log2 value.

The main goal of this analysis is to assess whether the different miRNA species are differentially expressed across tissues, each sample is colored according to it. Mind that as the final plot will be interactive, all the possible relevant information about each sample will be displayed when hovering over it.

log_norm_mirna <- cpm(dge, log = T, normalized.lib.sizes = )
PCA <- prcomp(t(log_norm_mirna), scale. = T)


In this second visualization, three outliers are found: samples BH-A0BZ-01A-31R-A12O-13, BH-A1FN-01A-11R-A13P-13 and E2-A1LH-01A-11R-A14C-13. Even if the former is the only one that has been already spotted as such, all three are removed from the data set. Prior to the PCA, the filtering of low expressed miRNAs and the counts normalization are performed again in order to capture the lacking of these miRNA species.

outliers <-  c("BH-A0BZ-01A-31R-A12O-13",
               "BH-A1FN-01A-11R-A13P-13",
               "E2-A1LH-01A-11R-A14C-13")

dge <- dge[, !rownames(dge$samples) %in% outliers]
metadata <- metadata[!rownames(metadata) %in% outliers, ]

mirna2keep <- filterByExpr(dge)
dge <- dge[mirna2keep, , keep.lib.sizes = F]

dge <- calcNormFactors(dge, method = "TMMwsp")
log_norm_mirna <- cpm(dge, log = T, normalized.lib.sizes = T)

PCA <- prcomp(t(log_norm_mirna), scale. = T)


There is a clear tendency in the 4 different clusters. Some parameters such as the sample’s vital status had been used in order to assess if there is any relation between them and the clustering. At the end, the sample’s origin plate and whether is present or not in both kind of tissues is the cause of it.


Plates A090 and A057 exhibit this effect the most followed by A085 (with a single sample) and A010. From these four plates, plates A057, A085 and A010 are the ones that were only used for tumor samples, while plate A090 was only used for normal tissue samples. It is important to mention that the small sample set used in this study, could be the reason behind this effect to be so large.


Differential Expression Analysis


It is of interest to see which miRNA species are expressed at different levels across tissues. With this goal in mind, a linear model is going to be fitted, to perform the differential expression analysis.


Design Matrix and Dispersion Estimation

The first step towards the objective is to build a design matrix. The design matrix links each tissue (columns) to the samples that belong to it (rows). Even if a slightly batch effect has been observed due to the plate of origin, the design in this study is considered to a one-way design under the assumption that the difference observed is not due to the plate of origin to be a potential confounder but an effect of the sample size.

design_mat <- model.matrix(~0 + dge$samples$group)
colnames(design_mat) <- levels(dge$samples$group)
  Normal Tumor
1      0     1
2      1     0
3      0     1
4      1     0
5      0     1
6      1     0


Due the nature of the data to be analyzed, there could be biological variability that may turn the observed counts into noise. Estimating this variability within the different tissues is going to help on modeling the statistical uncertainty associated with the observed counts and on making accurate inferences about differential expression between them.

The dispersion estimates, are computed with the edgeR function estimateDisp. This function uses the negative binomial distribution to model the read counts per gene per sample. With this function three different dispersion estimates are computed. The first one is an empirical Bayes moderated dispersion for each individual miRNA species (tagwise). The second one, the common dispersion, is a global estimate averaged across all species. Finally, a predicted dispersion for each miRNA is obtained from its abundance (trended).

In addition, to guard the tagwise estimates from outlier miRNAs with remarkably large or small individual dispersions, the robust = TRUE option is used (Phipson et al. 2016). The dispersion estimates are visualized with the edgeR function plotBCV. The biological coefficient of variation in the y-axis, is the square root of the negative binomial dispersion. In the x-axis the is the mean of the log normalized counts.

dge <- estimateDisp(dge, design_mat, robust=TRUE)


A characteristic of RNA-seq studies, is the tendency for higher negative binomial dispersions in genes with very low counts whereas the dispersion trend decreases asymptoticaly to a constant value for genes with higher counts (Chen, Lun, and Smyth 2016). Once more, due to the isomiRs naming strategy, the data contains a lot of low counts even after removing the less significant miRNA species.


In order to further account for technical variability. the negative binomial model can be extended with quasi-likelihood methods (Lun, Chen, and Smyth 2016) (Lund et al. 2012) . In this approach, the negative binomial dispersion trend describes the biological variability across all miRNAs. This overall trend is used to select the miRNA-specific variability that lay above and below it. Individual dispersions (tagwise) are not used.

Using the edgeR function glmQLFit several empirical Bayes statistics such as the quasi-likelihood dispersion trend, the squeezed quasi-likelihood dispersion estimates or the prior degrees of freedom, as well as the estimated values of the generalized linear model coefficients for each gene are computed. Incorporating robust handling of zero counts, especially when the fitted values of these counts yield useless residual degrees of freedom for the quasi-likelihood dispersion estimation (Phipson et al. 2016), is a focal point of this function. To obtain the squeezed quasi-likelihood dispersion estimates, the function tightens the raw quasi-likelihood dispersion estimates to adhere to a global trend. This serves to mitigate the uncertainty inherent in the estimates, thereby enhancing the statistical power of subsequent tests. Furthermore, the compression magnitude during the estimation is contingent upon the value of the prior degrees of freedom. Substantial prior degrees of freedom estimates indicate quasi-likelihood dispersions with lower variability across miRNA species. In such scenario, it is advisable to engage in robust empirical Bayes moderation. Conversely, smaller prior degrees of freedom estimates imply a higher degree of variability. In these cases, a milder moderation approach is more appropriate. This modulation, ensures that the empirical Bayes procedure adapts to the underlying characteristics of the data, yielding more accurate and meaningful results.

Setting robust = TRUE. as to facilitate the generation of miRNA-specific prior degrees of freedom estimates This approach minimizes the likelihood of encountering false positives from miRNA species with exceedingly high or low raw dispersions. Therefore, it ensures that the analysis remains resilient against data aberrations while amplifying its ability to identify significant differential expression across the majority of miRNAs.

The new dispersion estimates are visualized with the edgeR function plotQLDisp.

fit <- glmQLFit(dge, design_mat, robust=TRUE)


Differential Expression Analysis (p-value)

As a means to retrieve the upcoming results as a comparison between normal and tumor tissues, a contrast matrix is build using the limma (Ritchie et al. 2015) function makeContrasts.

contrast_mat <- makeContrasts(Normal - Tumor, levels = design_mat)


All the subsequent results will now be relative to normal tissues. That is, a positive log2-fold-change (logFC) will designate an over-expressed miRNA species in normal tissues. Conversely, a negative logFC will denote a more highly expressed miRNA species in tumor tissues.

To account for the uncertainty in dispersion estimation in the error rate control, rather than the usual likelihood ratio test, the quasi-likelihood F-tests is performed with the edgeR function glmQLFTest. This approach leads to p-values greater than or equal to the ones that the likelihood ratio tests would had provided using the same negative binomial dispersions.

Finally, the obtained p-values are submitted to a multiple testing correction as to regulate the false discovery rate (FDR) using the Benjamin-Hochberg method. The edgeR function topTags provides a table with all the differentially expressed miRNA species that have an adjusted p-value smaller than or equal to 0.01. The decision of setting the significance level to 0.01 instead of the usual 0.05, is made to account for the small size of the data set and the small batch effect previously seen. This stringent significance level aims to reduce the likelihood of false positives that the mentioned facts could originate. Note that the table with the results is not shown as it is automatically added in the next plot.

res <- glmQLFTest(fit, contrast = contrast_mat)
top <- topTags(res, n = Inf, p.value = 0.01)


Using the Glimma function glimmaVolcano the results of the previous analysis are displayed in an interactive volcano-plot.


Differential Expression Analysis (fold-change)

The previous differential expression analysis, was based on statistical significance. 2131 differentially expressed miRNA species between the normal and tumor tissues have been found among the 3291 present in the data set. Having this huge amount of differentially expressed miRNA species, incites narrowing down of the results to keep only the ones that are more biologically meaningful.

This new approach focuses on spotting whether the differential expression logFCs are significantly broad than log2(1.5) by using the edgeR function glmTreat. This method, prioritizes genes that are likely to have a higher biological significance with larger changes. It is worth mentioning that glmTreat does not set a fold change cutoff but evaluates both, the magnitude of change of expressions values and its variability. Thus, not only the logFC estimated value must be above the threshold, but its whole confidence interval must.

res2 <- glmTreat(fit, contrast = contrast_mat, lfc = log2(1.5))
top2 <- topTags(res2, n = Inf, p.value = 0.01)


The final amount of differentially expressed miRNA species is 1440. Once more, using the Glimma function glimmaVolcano the results of the analysis are visualized in an interactive volcano-plot.


Bibliography



Chen, Yunshun, Aaron T. L. Lun, and Gordon K. Smyth. 2016. “From Reads to Genes to Pathways: Differential Expression Analysis of RNA-Seq Experiments Using Rsubread and the edgeR Quasi-Likelihood Pipeline,” no. 5:1438 (August). https://doi.org/10.12688/f1000research.8987.2.
Lun, Aaron T. L., Yunshun Chen, and Gordon K. Smyth. 2016. “It’s DE-licious: A Recipe for Differential Expression Analyses of RNA-seq Experiments Using Quasi-Likelihood Methods in edgeR.” In Statistical Genomics: Methods and Protocols, edited by Ewy Mathé and Sean Davis, 391–416. Methods in Molecular Biology. New York, NY: Springer. https://doi.org/10.1007/978-1-4939-3578-9_19.
Lund, Steven P., Dan Nettleton, Davis J. McCarthy, and Gordon K. Smyth. 2012. “Detecting Differential Expression in RNA-sequence Data Using Quasi-Likelihood with Shrunken Dispersion Estimates.” Statistical Applications in Genetics and Molecular Biology 11 (5): /j/sagmb.2012.11.issue-5/1544-6115.1826/1544-6115.1826.xml. https://doi.org/10.1515/1544-6115.1826.
Phipson, Belinda, Stanley Lee, Ian J. Majewski, Warren S. Alexander, and Gordon K. Smyth. 2016. “Robust Hyperparameter Estimation Protects Against Hypervariable Genes and Improves Power to Detect Differential Expression.” The Annals of Applied Statistics 10 (2): 946–63. https://doi.org/10.1214/16-AOAS920.
Ritchie, Matthew E., Belinda Phipson, Di Wu, Yifang Hu, Charity W. Law, Wei Shi, and Gordon K. Smyth. 2015. “Limma Powers Differential Expression Analyses for RNA-sequencing and Microarray Studies.” Nucleic Acids Research 43 (7): e47. https://doi.org/10.1093/nar/gkv007.
Robinson, Mark D., Davis J. McCarthy, and Gordon K. Smyth. 2010. edgeR: A Bioconductor Package for Differential Expression Analysis of Digital Gene Expression Data.” Bioinformatics 26 (1): 139–40. https://doi.org/10.1093/bioinformatics/btp616.
Robinson, Mark D., and Alicia Oshlack. 2010. “A Scaling Normalization Method for Differential Expression Analysis of RNA-seq Data.” Genome Biology 11 (3): R25. https://doi.org/10.1186/gb-2010-11-3-r25.
---
bibliography: ../fdp.bib
link-citations: true
output:
  html_document:
    css: bootstrap.css
    highlight: haddock
    code_download: TRUE
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, comment = NA, warning = FALSE, message = FALSE)

library(dplyr)
library(edgeR)
library(factoextra)
library(ggplot2)
library(Glimma)
library(kableExtra)
library(knitr)
library(pheatmap)
library(plotly)
library(RColorBrewer)
library(readxl)
library(reshape2)
library(statmod)
library(tidyverse)

# Changing default table's style
kable <- function(data, ...){
        knitr::kable(data, 
                     format = "html",
                     digits = 3,
                     format.args = list(scientific = TRUE),
                     align = 'c',
                     linesep = '') %>%
        kable_paper() %>%
        scroll_box(width = "600px", height = "400px")
}

knit_print.data.frame <- function(x, ...){
        res <- paste(c('', '', kable(x)),
                     collapse = "\n")
        asis_output(res)
}

registerS3method("knit_print", 
                 "data.frame",
                 knit_print.data.frame)

# Print nothing if NA within a table
options(knit.kable.table.NA = '')

theme_set(theme_classic())
```

<br>

<h1>miRNA and isomiR Differential Expression Analysis in Breast Cancer</h1>

<h3 align="right">Iris Mestres Pascual</h3>

<hr>

<br>

In this study, the differential expression analysis of the results provided by
the *MIRFLOWZ* pipeline is performed. The main goal is to asses if there are any
miRNA species differentially expressed across normal and tumor tissues. The
conclusions are found in the main manuscript. Here, only the workflow with
step-by-step explanations is reported.

<br>

<h1>Data Preprocessing</h1>
<hr class="half-width">

<br>

<h2 align="right">Data Loading and Sample Selection</h2>

The *MIRFLOWZ* pipeline has been ran with a specific set of samples. Therefore, 
the metadata for the differential expression analysis must contain exactly the 
same ones. This information is provided in two different files, one containing 
data *per case* and the other one with *per sample* information.

The *per case* file, contains data from NCI's Cancer Genome Atlas Program 
(TCGA). Thus, after loading the data frame, only the cases belonging to the 
breast cancer (BRCA) study are kept.

```{r case_metadata, warning=FALSE}
raw_case_metadata <- read_excel("case_metadata.xlsx",
                           sheet = 1,
                           col_types = NULL)

case_metadata <- raw_case_metadata %>% 
  select(2:34) %>% 
  filter(type == "BRCA")
```

<br>

The *per sample* metadata is already from the TCGA-BRCA study so, from the data
set, a particular set of columns are selected; the file name (`File.Name`), the
case and sample ID (`Case.ID` and `Sample.ID` respectively), and the sample's
tissue type (`Sample.Type`). In addition, the file name will be formatted as to
match the sample names in the counts table and an extra field with the sample's
plate is added.

```{r sample_metadata}
raw_sample_metadata <- read.table("./sample_metadata.csv",
                                  sep = "\t",
                                  header = TRUE)

sample_metadata <- raw_sample_metadata %>% 
  select(c(File.Name, Case.ID, Sample.ID, Sample.Type))

for (i in 1:nrow(sample_metadata)){
  sample_metadata$File.Name[i] <- substr(raw_sample_metadata$File.Name[i], 6, 28)
  sample_metadata$Plate[i] <- strsplit(sample_metadata$File.Name[i], "-")[[1]][5]
}
```

<br>

Following the sample selection criteria used in the main report, female cases
containing solely a samples coming from a normal tissue and another one from a
primary tumor with no missing data are kept. In addition, any case with a new
tumor event or has died without tumors it is removed. Moreover, only
infiltrating ductal carcinoma bellow *Stage IIB* cases are maintained. Finally,
the sample's tissue types are modified to be labeled as either `Normal` or
`Tumor`.

```{r sample_selection}
sample_metadata <- sample_metadata %>% 
  group_by(Case.ID) %>% 
  filter( n() == 2 ,
          any(Sample.Type == "Solid Tissue Normal"), 
          !Sample.Type == "Metastatic") %>%
  arrange(Case.ID)

sample_metadata$Sample.Type[sample_metadata$Sample.Type == "Primary Tumor"] <- "Tumor"
sample_metadata$Sample.Type[sample_metadata$Sample.Type == "Solid Tissue Normal"] <- "Normal"

case_metadata <- case_metadata %>%
  filter(bcr_patient_barcode %in% sample_metadata$Case.ID,
         !(vital_status == "Dead"& tumor_status == "TUMOR FREE"),
         gender == "FEMALE",
         !is.na(tumor_status),
         is.na(new_tumor_event_type),
         race != "[Not Available]",
         histological_type == "Infiltrating Ductal Carcinoma",
         !ajcc_pathologic_tumor_stage %in% c("Stage IIIB", 
                                             "Stage IIIC", 
                                             "Stage IV")) %>%
  select(bcr_patient_barcode, 
         age_at_initial_pathologic_diagnosis,
         race, 
         ajcc_pathologic_tumor_stage,
         vital_status,
         tumor_status, 
         menopause_status, 
         initial_pathologic_dx_year, 
         birth_days_to, 
         death_days_to) 
colnames(case_metadata)[1] <- "Case.ID"

sample_metadata <- sample_metadata %>%
  filter(Case.ID %in% case_metadata$Case.ID) %>%
  select(File.Name, Case.ID, Sample.Type, Plate) %>%
  arrange(File.Name)
```

<br>

The filtered data sets are combined into a single one. The merge leads to have
samples coming from normal tissues to have a cancer stage assigned thus, they
are labeled as having no stage.

```{r merge_metadata}
metadata <- sample_metadata %>% 
  inner_join(case_metadata, by = "Case.ID")

metadata <- column_to_rownames(metadata, var = "File.Name")

metadata$ajcc_pathologic_tumor_stage[metadata$Sample.Type == "Normal"] <- "None"
```

<br>

The sample's count tables are joined together. There are miRNA species present
in one or several samples that are not in others. With the merging, these counts
are changed from `NA` to 0. Furthermore, the sample file names are used as
column names and, given that the file names in the metadata and the counts table
contain different separator characters, the column names in the counts table are
changed to match the metadata. Finally, all the loaded information is stored in
an `edgeR` [@robinsonEdgeRBioconductorPackage2010].

```{r edgeR_mir-tables}
options(scipen=999)

fnames <- list.files("./mirna_samples", full.names = T)
mirna_counts <-  read.csv(fnames[1],sep = "\t", header = TRUE)

for (table in fnames[-1]){
  table <- read.csv(table, header = TRUE, sep = "\t")
  mirna_counts <- mirna_counts %>% full_join(table)
}

mirna_counts[is.na(mirna_counts)] <- 0

rownames(mirna_counts) <- mirna_counts[,1]
mirna_counts <- mirna_counts[,-1]

colnames(mirna_counts) <-gsub("\\.", "-", colnames(mirna_counts))

dge <- DGEList(counts = mirna_counts, group = as.factor(metadata$Sample.Type))
options(digits = 2)
```

<br>

<h2 align="right">Exploratory Data Analysis</h2>

In this exploratory data analysis the nature of the data is analyzed from
different perspectives as to identify trends, relationships and possible
anomalies such as outliers or batch effects.

<h3 align="right">Data Distribution</h3>

The first visualization focuses on the miRNA species counts per sample as a
way to get a glance of the count's distribution. In the following plot, each
horizontal line represents a miRNA species. For an easier interpretation,
samples are colored.

```{r countpersample, echo=F, fig.align = 'center'}
mirna_melt <- melt(as.matrix(dge$counts))

t <- ggplot(mirna_melt, aes(x = Var2, 
                            y = value, 
                            group = Var1, 
                            color = Var2)) +
  geom_line(stat = "identity") +
  labs( title = "miRNA Species Counts per Sample", 
        x = "Sample", y = "Counts") +
  theme(legend.position = "none", 
        axis.text.x = element_blank())
t
```

<br>

The most notable aspect of the plot is the presence of low counts between the
different samples. Due to the *MIRFLOWZ* unambiguous notation, there are very
specific isomiRs present in low frequency in just a few samples. Hence, this
trend was expected. On the other hand, a second set of miRNA species has a
greater presence among the samples with up to 1000000 counts and only four
samples stand out, being the higher count from the sample
`E2-A1LH-01A-11R-A14C-13` in the `hsa-miR-21-5p` canonical miRNA.

From a statistical point of view, genes with consistently low counts are very
unlikely be assessed as significantly differentially expressed because low
counts do not provide enough statistical evidence for a reliable judgement to be
made. Such genes can therefore be removed from the analysis without any loss of
information.

In order to asses which miRNAs will be kept, the `edgeR` function `filterByExpr`
is used. This function bases the filtering on count-per-million (CPM) values so
as to avoid favoring genes that are expressed in larger libraries over those
expressed in smaller libraries following a similar approach as the one used by
@chenReadsGenesPathways2016.

```{r mirna2keep}
mirna2keep <- filterByExpr(dge)
dge <- dge[mirna2keep, , keep.lib.sizes=F]
```

<br>

Furthermore, to eliminate composition biases between libraries, the trimmed mean
of M-values (TMM) variant proposed by @robinsonScalingNormalizationMethod2010a,
"TMM with singleton pairing" ( *TMMwsp* ), is applied to the data set with the
`edgeR` function `calcNormFactors`. This variant is preferred over the normal
TMM as it is intended to perform better for data with a high proportion of
zeros. As a result, an extra field with the library scaling factors is added to
the `DGEList` object.

```{r normalization}
dge <- calcNormFactors(dge, method = "TMMwsp")
```

<br>

<h3 align="right">Principal Component Analysis</h3>

In order to accomplish a greater understanding of the data, a Principal
Component Analysis (PCA) is performed using the `R` function `prcomp`. The PCA
must be conducted over the normalized counts which are obtained with the `edgeR`
function `cpm`. In addition, the normalized counts are returned as their log~2~
value.

The main goal of this analysis is to assess whether the different miRNA species
are differentially expressed across tissues, each sample is colored according to
it. Mind that as the final plot will be interactive, all the possible relevant
information about each sample will be displayed when hovering over it.

```{r tissue PCA}
log_norm_mirna <- cpm(dge, log = T, normalized.lib.sizes = )
PCA <- prcomp(t(log_norm_mirna), scale. = T)
```

```{r plot_tissue-PCA, echo=F, fig.align = 'center'}
components <- data.frame(PCA$x)

Case <- metadata$Case.ID
Sample <- rownames(components)
Stage <- metadata$ajcc_pathologic_tumor_stage
Tissue <- metadata$Sample.Type
"Vital status" <- metadata$vital_status

gg <- ggplot(components,aes(x = PC1, y= PC2, 
                            case = Case,
                            sample = Sample,
                            shape = Tissue, 
                            color = Tissue, 
                            stage = Stage, 
                            vital = `Vital status`)) +
    geom_point(alpha = 0.8, size = 4) +
  geom_hline(yintercept=0, linetype="dashed", alpha = 0.2) +
  geom_vline(xintercept=0, linetype="dashed", alpha = 0.2) +
  labs(title = "miRNA Expression Across Samples PCA", 
       x = "PC1 (23.01%)", 
       y = "PC2 (9.35%)") +
  scale_color_manual(values=c("#40E0D0","#8A2BE2")) +
  scale_shape_manual(values = c(1, 5))

ggplotly(gg)
```

<br>

In this second visualization, three outliers are found: samples
`BH-A0BZ-01A-31R-A12O-13`, `BH-A1FN-01A-11R-A13P-13` and
`E2-A1LH-01A-11R-A14C-13`. Even if the former is the only one that has been
already spotted as such, all three are removed from the data set. Prior to the
PCA, the filtering of low expressed miRNAs and the counts normalization are
performed again in order to capture the lacking of these miRNA species.

```{r removeOutliers}
outliers <-  c("BH-A0BZ-01A-31R-A12O-13",
               "BH-A1FN-01A-11R-A13P-13",
               "E2-A1LH-01A-11R-A14C-13")

dge <- dge[, !rownames(dge$samples) %in% outliers]
metadata <- metadata[!rownames(metadata) %in% outliers, ]

mirna2keep <- filterByExpr(dge)
dge <- dge[mirna2keep, , keep.lib.sizes = F]

dge <- calcNormFactors(dge, method = "TMMwsp")
log_norm_mirna <- cpm(dge, log = T, normalized.lib.sizes = T)

PCA <- prcomp(t(log_norm_mirna), scale. = T)
```

```{r plot_removeOutliers, echo=F, fig.align = 'center'}
components <- data.frame(PCA$x)

Case <- metadata$Case.ID
Sample <- rownames(components)
Stage <- metadata$ajcc_pathologic_tumor_stage
Tissue <- metadata$Sample.Type
"Vital status" <- metadata$vital_status

gg <- ggplot(components,aes(x = PC1, y= PC2, 
                            case = Case,
                            sample = Sample,
                            shape = Tissue, 
                            color = Tissue, 
                            stage = Stage, 
                            vital = `Vital status`)) +
    geom_point(alpha = 0.8, size = 4) +
  geom_hline(yintercept=0, linetype="dashed", alpha = 0.2) +
  geom_vline(xintercept=0, linetype="dashed", alpha = 0.2) +
  labs(title = "miRNA Expression Across Samples PCA\nwithout outliers", 
       x = "PC1 (27.37%)", 
       y = "PC2 (14.02%)") +
  scale_color_manual(values=c("#40E0D0","#8A2BE2")) +
  scale_shape_manual(values = c(1, 5))

ggplotly(gg)
```

<br>

There is a clear tendency in the 4 different clusters. Some parameters such as
the sample's vital status had been used in order to assess if there is any
relation between them and the clustering. At the end, the sample's origin plate
and whether is present or not in both kind of tissues is the cause of it.

```{r plot_plate-PCA, echo=F, fig.align = 'center'}
Plate <- metadata$Plate

plates_col = c( "#4B0082", "#00BFFF", "#FF0000", "#EB22BE", "#7CFC00",
                "#FF1493", "#8A2BE2", "#228B22", "#C71585", "#FF8C00",
                "#9900CC", "#A52A2A", "#00FFFF", "#FFD700", "#00FA9A")

gg <- ggplot(components,aes(x = PC1, y= PC2, 
                            case = Case,
                            sample = Sample,
                            stage = Stage, 
                            color = Plate,
                            shape = Tissue,
                            vital = `Vital status`)) +
    geom_point(alpha = 0.8, size = 4) +
  geom_hline(yintercept=0, linetype="dashed", alpha = 0.2) +
  geom_vline(xintercept=0, linetype="dashed", alpha = 0.2) +
  labs(title = "miRNA Expression Across Samples per Plate PCA", 
       x = "PC1 (27.37%)", 
       y = "PC2 (14.02%)") +
  scale_color_manual(values = plates_col) +
  scale_shape_manual(values = c(1, 5)) 

ggplotly(gg)
```

<br>

Plates `A090` and `A057` exhibit this effect the most followed by `A085` (with a
single sample) and `A010`. From these four plates, plates `A057`, `A085` and 
`A010` are the ones that were only used for tumor samples, while plate `A090`
was only used for normal tissue samples. It is important to mention that the
small sample set used in this study, could be the reason behind this effect to
be so large.

<br>

<h1>Differential Expression Analysis</h1>
<hr class="half-width2">

It is of interest to see which miRNA species are expressed at different levels
across tissues. With this goal in mind, a linear model is going to be fitted, to
perform the differential expression analysis.

<br>

<h2 align="right">Design Matrix and Dispersion Estimation</h2>

The first step towards the objective is to build a design matrix. The design
matrix links each tissue (columns) to the samples that belong to it (rows). Even
if a slightly batch effect has been observed due to the plate of origin, the
design in this study is considered to a one-way design under the assumption that
the difference observed is not due to the plate of origin to be a potential
confounder but an effect of the sample size.

```{r design mat.v2}
design_mat <- model.matrix(~0 + dge$samples$group)
colnames(design_mat) <- levels(dge$samples$group)
```

```{r view_design-mat, echo=F}
head(design_mat)
```

<br>

Due the nature of the data to be analyzed, there could be biological variability
that may turn the observed counts into noise. Estimating this variability within
the different tissues is going to help on modeling the statistical uncertainty
associated with the observed counts and on making accurate inferences about
differential expression between them.

The dispersion estimates, are computed with the `edgeR` function `estimateDisp`.
This function uses the negative binomial distribution to model the read counts
per gene per sample. With this function three different dispersion estimates are
computed. The first one is an empirical Bayes moderated dispersion for each
individual miRNA species (*tagwise*). The second one, the common dispersion, is
a global estimate averaged across all species. Finally, a predicted dispersion
for each miRNA is obtained from its abundance (*trended*).

In addition, to guard the *tagwise* estimates from outlier miRNAs with
remarkably large or small individual dispersions, the `robust = TRUE` option is
used [@phipsonRobustHyperparameterEstimation2016]. The dispersion estimates are
visualized with the `edgeR` function `plotBCV`. The biological coefficient of
variation in the `y-axis`, is the square root of the negative binomial
dispersion. In the `x-axis` the is the mean of the log normalized counts.

```{r disp.estimat}
dge <- estimateDisp(dge, design_mat, robust=TRUE)
```

```{r plot_dis.estimat, echo=F, fig.align = 'center'}
plotBCV(dge, main = "miRNA Counts Dispersion Estimates", 
        frame.plot=FALSE , 
        col.trend = "#8A2BE2", col.common = "#00BFFF")
box(bty="l")
axis(2)
axis(1)
```

<br>

A characteristic of RNA-seq studies, is the tendency for higher negative
binomial dispersions in genes with very low counts whereas the dispersion trend
decreases asymptoticaly to a constant value for genes with higher counts
[@chenReadsGenesPathways2016]. Once more, due to the isomiRs naming strategy,
the data contains a lot of low counts even after removing the less significant
miRNA species.

```{r plot_counts-dist, echo=F, fig.align = 'center'}
mirna_melt <- melt(log_norm_mirna)

ggplot(mirna_melt, aes(x = value)) +
  geom_histogram(binwidth = 0.5, 
                 color = "#FFFFFF", 
                 fill = "#9900CC",
                 alpha = 0.7) +
  labs( title = "Normalized log Counts Distribution", 
        x = "Counts", y = "miRNA species") +
  theme(legend.position = "none")
```

<br>

In order to further account for technical variability. the negative binomial
model can be extended with quasi-likelihood methods [@lunItDEliciousRecipe2016a]
[@lundDetectingDifferentialExpression2012] . In this approach, the negative
binomial dispersion trend describes the biological variability across all
miRNAs. This overall trend is used to select the miRNA-specific variability that
lay above and below it. Individual dispersions (*tagwise*) are not used.

Using the `edgeR` function `glmQLFit` several empirical Bayes statistics such as
the quasi-likelihood dispersion trend, the squeezed quasi-likelihood dispersion
estimates or the prior degrees of freedom, as well as the estimated values of
the generalized linear model coefficients for each gene are computed.
Incorporating robust handling of zero counts, especially when the fitted values
of these counts yield useless residual degrees of freedom for the
quasi-likelihood dispersion estimation 
[@phipsonRobustHyperparameterEstimation2016], is a focal point of this function.
To obtain the squeezed quasi-likelihood dispersion estimates, the function
tightens the raw quasi-likelihood dispersion estimates to adhere to a global
trend. This serves to mitigate the uncertainty inherent in the estimates,
thereby enhancing the statistical power of subsequent tests. Furthermore, the
compression magnitude during the estimation is contingent upon the value of the
prior degrees of freedom. Substantial prior degrees of freedom estimates
indicate quasi-likelihood dispersions with lower variability across miRNA
species. In such scenario, it is advisable to engage in robust empirical Bayes
moderation. Conversely, smaller prior degrees of freedom estimates imply a 
higher degree of variability. In these cases, a milder moderation approach is 
more appropriate. This modulation, ensures that the empirical Bayes procedure
adapts to the underlying characteristics of the data, yielding more accurate and
meaningful results.

Setting `robust = TRUE`. as to facilitate the generation of miRNA-specific prior
degrees of freedom estimates This approach minimizes the likelihood of 
encountering false positives from miRNA species with exceedingly high or low raw
dispersions. Therefore, it ensures that the analysis remains resilient against 
data aberrations while amplifying its ability to identify significant 
differential expression across the majority of miRNAs.

The new dispersion estimates are visualized with the `edgeR` function
`plotQLDisp`.

```{r quasi-likeliFit}
fit <- glmQLFit(dge, design_mat, robust=TRUE)
```

```{r plot_quasi-likeliFit, echo=F, fig.align = 'center'}
plotQLDisp(fit,  main = "miRNA Counts Quasi-Likelihood Dispersion Estimates", 
           frame.plot=FALSE , 
           col.trend = "#8A2BE2", col.shrunk = "#00BFFF")
box(bty="l")
axis(2)
axis(1)
```

<br>

<h2 align="right">Differential Expression Analysis (p-value)</h2>

As a means to retrieve the upcoming results as a comparison between normal and
tumor tissues, a contrast matrix is build using the `limma`
[@ritchieLimmaPowersDifferential2015] function `makeContrasts`.

```{r contrast_mat}
contrast_mat <- makeContrasts(Normal - Tumor, levels = design_mat)
```

<br>

All the subsequent results will now be relative to normal tissues. That is, a
positive log~2~-fold-change (logFC) will designate an over-expressed miRNA 
species in normal tissues. Conversely, a negative logFC will denote a more
highly expressed miRNA species in tumor tissues.

To account for the uncertainty in dispersion estimation in the error rate
control, rather than the usual likelihood ratio test, the quasi-likelihood
F-tests is performed with the `edgeR` function `glmQLFTest`. This approach leads
to *p-values* greater than or equal to the ones that the likelihood ratio tests
would had provided using the same negative binomial dispersions.

Finally, the obtained *p-values* are submitted to a multiple testing correction
as to regulate the false discovery rate (FDR) using the Benjamin-Hochberg
method. The `edgeR` function `topTags` provides a table with all the
differentially expressed miRNA species that have an adjusted *p-value* smaller
than or equal to 0.01. The decision of setting the significance level to 0.01
instead of the usual 0.05, is made to account for the small size of the data set
and the small batch effect previously seen. This stringent significance level
aims to reduce the likelihood of false positives that the mentioned facts could
originate. Note that the table with the results is not shown as it is
automatically added in the next plot.

```{r qlfTest}
res <- glmQLFTest(fit, contrast = contrast_mat)
top <- topTags(res, n = Inf, p.value = 0.01)
```

<br>

Using the `Glimma` function `glimmaVolcano` the results of the previous analysis
are displayed in an interactive volcano-plot.

```{r volcano, echo=F, fig.align = 'center'}
glimmaVolcano(res, dge = fit, 
              status = decideTestsDGE(res, p.value = 0.01),
              main = "Normal vs Tumor")
```

<br>

<h2 align="right">Differential Expression Analysis (fold-change)</h2>

The previous differential expression analysis, was based on statistical significance. 2131 differentially expressed miRNA species between the normal and
tumor tissues have been found among the 3291 present in the data set. Having
this huge amount of differentially expressed miRNA species, incites narrowing
down of the results to keep only the ones that are more biologically meaningful.

This new approach focuses on spotting whether the differential expression logFCs
are significantly broad than log~2~(1.5) by using the `edgeR` function
`glmTreat`. This method, prioritizes genes that are likely to have a higher
biological significance with larger changes. It is worth mentioning that
`glmTreat` does not set a fold change cutoff but evaluates both, the magnitude
of change of expressions values and its variability. Thus, not only the logFC
estimated value must be above the threshold, but its whole confidence interval
must.

```{r glmTreat}
res2 <- glmTreat(fit, contrast = contrast_mat, lfc = log2(1.5))
top2 <- topTags(res2, n = Inf, p.value = 0.01)
```

<br>

The final amount of differentially expressed miRNA species is 1440. Once more,
using the `Glimma` function `glimmaVolcano` the results of the analysis are
visualized in an interactive volcano-plot.

```{r volcano2, echo=F, fig.align = 'center'}
glimmaVolcano(res2, dge=fit, 
              status = decideTestsDGE(res2, p.value = 0.01), 
              main = "Normal vs Tumor")
```

<br>

# Bibliography

<hr class="half-width">

<br>
