Introduction
1. set up parameters
Section I: identify proteins that bind miRNA sequences
Section II: predict miR-93 and miR-484 targets
10.Explore miR-93 putative targets
Section III: Identify additional miRNAs NCL may regulate
1. Download shear regulated miR expression from GEO (GSE26953)
2. Annotate predicted NCL regulated miRs with shear Fold changes
Section IV: Analyze serum levels of miR-93-3p, miR-93-5p, and miR-484
References

Introduction

This documentation provides a detailed description of the functions and workthrough used for the bioinformatic prediction of microRNA (miR) binding proteins that may regulate the maturation of miRs in the endothelium in response to Pulsatile (PS) or Oscillatory (OS) shear stress. This analysis accompanies the publication titled “Shear Stress Regulation of miR-93 and miR-484 Maturation through Nucleolin” and reproduces graphs and tables contained in the publication.

This analysis is divided into four sections: Section I involves identifying RNA binding proteins that potentially regulate the maturation of microRNAs (miR) in the endothelium as well as which ones of them are regulated by AMP activated protein kinase (AMPK), a well established mediator or energy homeostasis and mechano-signaling pathway that prevents endothelial dysfunction (1-4). This analysis identifies NCL as an important regulator of miR-93 and miR-484 biogenesis. Section II involves identifying endothelial transcripts that are targeted by miRs regulated by miR binding proteins identified in section I. In this analysis, we foculed on nucleolin (NCL) and its regulation of miR-93 and miR-484. Therefore, in this section, we identified targets of both miR-93 and miR-484, establishing them as miRs that are governed by shear stress and confer endothelial dycfunction by targeting KLF2 and eNOS, two markers of endothelial homeostasis. Section III involves mining online databases for the identification of miRs NCL putatively regulates. Finally, Section IV details the steps and statistical methods used in the validation of miR-93 and miR-484 in the serum of people with coronray artery disease (CAD) compared to healthy controls (CTRL).

Set Up parameters

In this section, all libraries used in the analysis are loaded and the working directory is set (excluded from view). Additionally, the organism and BioMartdataset objects are assigned which are used later in the analysis to query BioMart for annotation information. Functions for this analysis can be found at the following github repository: https://github.com/brengong/ConservationtextmineR

devtools::install_github('brengong/ConservationtextmineR')

#### load packages ####
suppressMessages(library(data.table))
suppressMessages(library(tidyverse))
suppressMessages(library(readxl))
suppressMessages(library(ggpubr))
suppressMessages(library(dplyr))
suppressMessages(library(dtplyr))
suppressMessages(library(purrr))
suppressMessages(library(stringr))
suppressMessages(library(gridExtra))
suppressMessages(library(caret))
suppressMessages(library(plotly))
suppressMessages(library(biomaRt))
suppressMessages(library(systemPipeR))
suppressMessages(library(GenomicFeatures))
suppressMessages(library(BiocParallel))
suppressMessages(library(ConservationtextmineR))

#### set up parameters ####
Organ <- "Homo sapiens" 
BioMartdataset <-"hsapiens_gene_ensembl"

Section I: identify proteins that bind miRNA sequences

Load and compile RNA binding protein data tables

In this section, we used data obtained from the A daTabase of RNA binding proteins and AssoCiated moTifs (ATtRACT) data base (https://attract.cnic.es/) to obtain a comprehensive list of RNA binding proteins and their cognate binding motifs. The link to download the data is located at: https://attract.cnic.es/attract/static/download/download_db/dd088b2b-8dd3-4e31-8e73-e1b866983a1e.txt. This information is ultimately used to determine which RNA binding proteins can bind to miR sequences. In addition to this database. NCL is an additional RNA binding protein identified based on literature searches (5-6) and was included in the ultimate table. Both the direct and indirect targeting information for NCL was included.

DT <- fread("./data/1 RRM protein database.csv") 
DTsub <- DT[,c("Gene ID", "Gene Name", "Organism", "Motif")]
DTsub <- DTsub[DTsub$Organism == "Homo_sapiens",]
DTsub <- DTsub[!duplicated(DTsub$`Gene Name`),]
DTsub <- unique(DTsub,by = c('Gene Name','Motif'))
DTsub$`Gene Name` <- gsub("[*]", "", DTsub$`Gene Name`)
setnames(DTsub, c("Gene ID", "Gene Name", "Organism", "Motif"),
         c("Gene_ID", "Targeting_Factor", "Organism", "Consensus_Sequence"))
Gene_ID <- c("ENSG00000115053", "ENSG00000115053" )
Targeting_Factor <- c("NCL", "NCL"); Organism <- c("Homo_sapiens", "Homo_sapiens")
Consensus_Sequence <- c("UCCCGA", "GU")
NCL <- as.data.table(cbind(Gene_ID, Targeting_Factor, Organism, Consensus_Sequence))
DTsub <- rbind(DTsub, NCL); rm(NCL); rm(Consensus_Sequence); rm(Gene_ID); rm(Organism); rm(Targeting_Factor)
DTsub

##               Gene_ID Targeting_Factor     Organism Consensus_Sequence
##   1:  ENSG00000136450            SRSF1 Homo_sapiens            CAGACAA
##   2: ENSG00000161547*            SRSF2 Homo_sapiens           GAAUCCCG
##   3:  ENSG00000104824           HNRNPL Homo_sapiens          CACACCACA
##   4:  ENSG00000136231          IGF2BP3 Homo_sapiens            AAACUCA
##   5:  ENSG00000143889          HNRNPLL Homo_sapiens            ACAUACA
##  ---                                                                  
## 176:  ENSG00000160710             ADAR Homo_sapiens               GCGC
## 177:  ENSG00000104413            ESRP1 Homo_sapiens             AGGGAU
## 178:  ENSG00000108771            DHX58 Homo_sapiens              GCGCG
## 179:  ENSG00000115053              NCL Homo_sapiens             UCCCGA
## 180:  ENSG00000115053              NCL Homo_sapiens                 GU

Load and compile miRNA data table

To obtain miR sequences to query RNA binding protein sequence binding information against, both mature and immature miR sequences were downloaded from miRbase. The link to download both the mature and immature miR sequence information is located at: http://www.mirbase.org/ftp.shtml. Once the fasta files are downloaded and saved in the working directory, the seqCompile function reads these fasta files and converts them into a data table containing the sequence information, the miR name, and if the miRNA sequence corresponds to the immature or mature sequence. In this section of code, the results of the first 6 mature miRNA sequences are printed.

RefSeqHPmiRNA <- seqCompile(files= "./data/2018-12-18 hairpin.fa.gz", 
                           direct = getwd(), type = "miRNA", miRNA_type = "IMMATURE_HAIR_PIN" )
RefSeqIDMatMiRNA <- seqCompile(files= "./data/2018-12-18 mature.fa.gz", 
                               direct = getwd(), type = "miRNA", miRNA_type = "MATURE" )
MiRNADT <- rbind(RefSeqHPmiRNA, RefSeqIDMatMiRNA)
tail(MiRNADT)

##                  Sequence         miRNA_Name miRNA_type
## 1: GAGGAUGCUGAUCAUUCACUGG   smc-miR-12461-5p     MATURE
## 2: AGUAAAUGAUCAGCAUCCUCCA   smc-miR-12461-3p     MATURE
## 3: UUCUGCUCCUAUUUAAGUCAAU   gga-miR-1784b-5p     MATURE
## 4: UGAUUUCAAUAAGAGCAGAAUU   gga-miR-1784b-3p     MATURE
## 5:  AGCAUAGAAUGUCAGAUCUAG   mdo-miR-7385g-3p     MATURE
## 6:  AGCAUAGAAUGUCAGGUCUAG mdo-miR-7385g-2-3p     MATURE

Next, a system of tracking the same miRs in different species is needed. To accomplish this, MiRNASpeciesAnnot reads the three letter species designation located in the first three letters of the miR name and annotates the data table with the common name and scientific name of the species the miR sequence belongs to. Next, in order to track the same miR in different species, MiRNAname strips the three letter species designation from the miR name and adds the species deidentified miR name to a new column labeled “miRNA”.

MiRNADT <- MiRNASpeciesAnnot(MiRNADT)
MiRNADT <- MiRNAname(MiRNADT)
head(MiRNADT)[,2:ncol(MiRNADT)]

##    miRNA_Name        miRNA_type Common_Name        Scientific_Name   miRNA
## 1:  CEL-LET-7 IMMATURE_HAIR_PIN   roundworm Caenorhabditis_elegans   LET-7
## 2:  CEL-LIN-4 IMMATURE_HAIR_PIN   roundworm Caenorhabditis_elegans LIN-LIN
## 3:  CEL-MIR-1 IMMATURE_HAIR_PIN   roundworm Caenorhabditis_elegans   MIR-1
## 4:  CEL-MIR-2 IMMATURE_HAIR_PIN   roundworm Caenorhabditis_elegans   MIR-2
## 5: CEL-MIR-34 IMMATURE_HAIR_PIN   roundworm Caenorhabditis_elegans  MIR-34
## 6: CEL-MIR-35 IMMATURE_HAIR_PIN   roundworm Caenorhabditis_elegans  MIR-35

Identify RNABP miRNA matches

In this section, out goal is to use the miR data table (designated miRNADT in the code above) and the RNA binding protein data tables downloaded, compiled, and cleaned in the previous sections, to map the binding sequences of all RNA binding proteins onto miR sequences for every species. In this analysis, we are interested in predicting which RNA binding proteins can facilitate the maturation of miRs. Threfore, we used only mature miR sequences in this analysis. The TFpredict function matches sequences from one data table to binding sequences from a second data table. It then compiles a new data table containing the binding sequence (consensus sequence) used, the start and stop position on the querey sequence (miR sequence) for which the binding site was identified, the gene ID for the RNA binding protein, the RNA binding protein queried (Targeting Factor), the species the RNA binding prrotein is from (Organism), the length of the queried sequence (length), the sequence queried with the binding site presented in lower case (Sequence), the identified sequence with the flanking residue (Identified Sequence), the name of the miR the sequence belongs to (gene_symbol), the common name (Species) and scientific name (Scientific Name), of the miR sequence.

#### Perform the transcription factor query computations on immature miRNA hairpins
MiRNADTim <- MiRNADT[MiRNADT$miRNA_type == "IMMATURE_HAIR_PIN",]
MiRNADTim$Sequence <- as.character(MiRNADTim$Sequence)
setnames(MiRNADTim, c("Sequence", "miRNA_Name", "miRNA_type", "Common_Name", "Scientific_Name", "miRNA"),
         c("Sequence", "miRNA_Name", "miRNA_type", "Common_Name", "Scientific_Name", "gene_symbol"))
TX_TOT_Species <- TFpredict(Target = MiRNADTim, Targeting_Factor_DT = DTsub, type = "multiple_species")

Output of the resulting data table (excluding the sequences, gene IDs, and the species for the RBP)

TX_TOT_Species[,c(1:4,6,8, 10:11)]

##         Consensus_Sequence start end Number_Hits Targeting_Factor length Identified_sequence gene_symbol
##      1:              AGAAC    78  82           3           DAZAP1    182             UagaacA     MIR-319
##      2:              AGAAC    78  82           3             YBX1    182             UagaacA     MIR-319
##      3:              AGAAC    78  82           3        HNRNPA2B1    182             UagaacA     MIR-319
##      4:              AGAAC    78  82           3           DAZAP1    182             UagaacA     MIR-319
##      5:              AGAAC    78  82           3             YBX1    182             UagaacA     MIR-319
##     ---                                                                                                 
## 647466:              GUGGU   133 137           2           ZNF346    247             AgugguG   MIR-11970
## 647467:               UCAC   175 178           1            NOVA1    247              AucacG   MIR-11970
## 647468:              UUGUU    98 102           1            CELF2    247             UuuguuA   MIR-11970
## 647469:             UUUAUA   112 117           1             PPIE    247            UuuuauaA   MIR-11970
## 647470:              UUUUA   111 115           1            CPEB1    247             AuuuuaU   MIR-11970

Next, because we the biological sciences use rodent models to validate bioinformatic predictions, we were interested in retaining only predicted associations conserved at a minimum between Human, mouse, and rat. To accomplish this, SpeciesTFCons uses a character vector containing species that are required for the predicted interaction to be present in. It then retains only miR RNA binding protein interactions conserved at a minimum between the species required (in this case, human, mouse, and rat). If the interaction is conserved between the minimum designated species, all other species the predicted interaction is observed in are retained. The number of records prior to and after running this filter are illustrated below.

DT25 <- SpeciesTFCons(DT = TX_TOT_Species,Spec = c("Human", "Mouse", "Rat"), provide = "TF_Target")
nrow(TX_TOT_Species)
nrow(DT25)

## [1] 647470

## [1] 72274

Identify predicted AMPK targets

In this section, RNA binding proteins that are predicted AMPK targets were identified. All other RNA binding proteins were discarded. To accomplish this, AMPK targets previously identified were used. R codes we previously wrote that generate a predicted AMPK target database was used for this portion of the analysis (7-8). The identification of AMPK targets was conducted in two steps: in the first step, the AMPK target database was loaded. The number of records prior to and after running this filter are illustrated below.

keys <- fread("./data/99 AMPK target keys.xls")
keys

##        Targeting_Factor
##     1:            HNF4G
##     2:            TRIM7
##     3:           TRIM41
##     4:          SLC2A10
##     5:          SLC13A3
##    ---                 
## 13415:          PCDHGB5
## 13416:           PCDHB9
## 13417:           H2AFB3
## 13418:           RIMBP3
## 13419:         PRAMEF25

In the second step, the merge function was used to retain only RNA binding proteins that are putative AMPK targets. The number of targets prior to and after this filter are executes are illustrated below.

AMPK_targs <- merge(DT25, keys, by = "Targeting_Factor")
AMPK_targs[,c(1:5, 8, 11:12)]

##        Targeting_Factor Consensus_Sequence start end Number_Hits length gene_symbol                Species
##     1:             ACO1             CAGUGG    23  28           1    105     MIR-140                 Lizard
##     2:             ACO1             CAGUGG    52  57           1     85   MIR-194-1                 Lizard
##     3:             ACO1             CAGUGG    54  59           1     87   MIR-194-2                 Lizard
##     4:             ACO1             CAGUGG    64  69           1    101    MIR-216A                 Lizard
##     5:             ACO1             CAGUGG    52  57           1     73     MIR-27A                 Lizard
##    ---                                                                                                    
## 67929:            ZFP36             UCUUCU    68  73           1     85     MIR-877                    Rat
## 67930:            ZFP36             UCUUCU    11  16           1    110     MIR-10A   White-lipped tamarin
## 67931:            ZFP36             UCUUCU    92  97           1    150     MIR-10A        Tasmanian devil
## 67932:            ZFP36             UCUUCU    88  93           1     93     MIR-10A Schmidtea mediterranea
## 67933:            ZFP36             UCUUCU    25  30           1     75     MIR-494              Wild boar

nrow(DT25)

## [1] 72274

nrow(AMPK_targs)

## [1] 67933

Identify miR binding proteins that are expressed in endothelial cells

In this step, the expression levels of identified miR binding proteins that are also putative AMPK targets in endothelial cells were analysed. Any proteins that were not expressed in the endothelium were discarded. To accomplish this we used a data source containing RNA expression profiles of endothelial cells subjected to pulsatile or oscillatory shear stress (9).

PSOS <- fread("./data/5 PSOS FC data.xls")
PSOS <- PSOS[!duplicated(PSOS$GeneName),][,c(2,33,34, 53, 54, 43, 44,63,64)]
setnames(PSOS, colnames(PSOS), 
         c("Gene.symbol", "setA_OS_20_hr", "setA_OS_24_hr", "setB_OS_20_hr", "setB_OS_24_hr", 
           "setA_PS_20_hr", "setA_PS_24_hr", "setB_PS_20_hr", "setB_PS_24_hr"))
AMPK_targs_PSOS <- AMPK_targs[(AMPK_targs$Targeting_Factor %in% PSOS$Gene.symbol)]

A final list of 66 RNA binding proteins was identified for further analysis.

length(AMPK_targs_PSOS[!duplicated(AMPK_targs_PSOS$Targeting_Factor),]$Targeting_Factor)

## [1] 66

Quantification of identified proteins

At this point of the analysis, a data table of RNA binding proteins and their putative miR binding matches across species evolution has been generated. These putative miR binding proteins are expressed in endothelial cells, predicted to be phosphorylated by AMPK, and the predicted interaction and the predicted AMPK phosphorylation of the RNA binding protein are both conserved at a minimum between human, mouse, and rat. In this section, the number of miRs each RNA binding protein is predicted to regulate and the species conservation of the binding site will be explored. To do this, a function called TFRankR was created. This function takes the data table obtained in the previous section and ranks proteins based on the number of proteins they target, the number of species containing each target interaction, the number of hits between the RNA binding protein and miR, or any combination of these ranking classifications. For this analysis, we determined the number of immature miRNAs each RNA binding protein targets in human.

TX_PROM <- TFRankR(DT = AMPK_targs_PSOS, sortBy = "Target", dec = TRUE)
TX_PROM <- TX_PROM[TX_PROM$Species == "Human",]
head(TX_PROM)

##    Targeting_Factor Number_targets Species
## 1:              NCL            237   Human
## 2:             NXF1             78   Human
## 3:            NOVA1             65   Human
## 4:           NUDT21             65   Human
## 5:           PIWIL1             60   Human
## 6:               F2             55   Human

Then a Plot of the number of miRs each RNA binding protein putatively regulates was generated.

ggplot(TX_PROM, aes(reorder(`Targeting_Factor`, -Number_targets), Number_targets)) + 
  geom_point(shape = 21, size = 3, colour = "black", fill = "gray") + 
  theme_pubr() +
  labs(x = "Gene Name", y = "MiRNA abundance") +
  scale_color_manual(values=c("#E31A1C")) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  geom_point(data = subset(TX_PROM, Targeting_Factor == "NCL"),
             aes(x=reorder(`Targeting_Factor`, -Number_targets), y=Number_targets), 
             shape = 21, size = 3, colour = "black", fill = "#08519C")

Because NCL was predicted to regulate the most number of miRs, we analyzed the species conservation of the NCL binding site across all miRs. NCL can both directly and indirectly regulate miRs (5-6). Whule “GU” sequences are a measure of indirect binding, “UCCCGA” sequences are a measure of direct binding (5-6). Because we were interested in the miRs NCL may directly regulate, we identified the species conservation of the “UCCCGA” binding sequence. This process was conducted in two steps. The first step involved ranking the miRs by species abundance of the “UCCCGA” binding site.

miR <- fread("./results/1 Raw immature miRNA hits.xls")
NCL_miR <- miR[miR$Targeting_Factor == "NCL" & miR$Consensus_Sequence == "UCCCGA",]
TX_SPEC <- TFRankR(DT = NCL_miR, sortBy = "species", dec = TRUE)
TX_SPEC

##      Targeting_Factor Number_Species gene_symbol
##   1:              NCL             19      MIR-93
##   2:              NCL             12     MIR-484
##   3:              NCL              9     MIR-206
##   4:              NCL              7     MIR-887
##   5:              NCL              5     MIR-598
##  ---                                            
## 260:              NCL              1      MIR-H7
## 261:              NCL              1   MIR-M59-2
## 262:              NCL              1      MIR-M7
## 263:              NCL              1   MIR-RO6-1
## 264:              NCL              1   MIR-RR1-4

In the second step, the ranking results were graphed.

maxct <- max(TX_SPEC$Number_Species)
howmany <- 30
ggplot(TX_SPEC[1:howmany], 
             aes(y=Number_Species, x=reorder(gene_symbol,Number_Species), 
                 ymin=0, ymax=Number_Species), col="blue") +
  geom_point(size=2.5, color="#E7B800") + 
  geom_linerange(size=1, color="#00AFBB") +
  theme(plot.title = element_text(size = 10,colour="black")) +  
  coord_flip() +
  theme_pubr()+#
  theme(axis.text=element_text(size=rel(1)),axis.title=element_text(size=rel(1.25))) +
  geom_text(aes(label = paste(Number_Species,sep=":::"), y = Number_Species + .05*maxct), size = 4) +
  ggtitle("NCL binding site conservation\n") +
  ylab("Number of Species") +
  xlab("pre-miR\n")

Section II: Identification of miR-93 and miR-484 targets

In this section, the goal is to identify targets of miR-93 and miR-484 in endothelial cells responding to shear stress. To accomplish this, first the seed sequence of miR-93 and miR-484 were used to determine which mRNAs contain a reverse complement miR-targeting SEED sequence (10). This was conducted for human, mouse, and rat species. Only targeting sequences conserved between human, mouse, and rat were retained for further analysis. Next, we downloaded and analyzed RNAseq datasets from endothelial cells overexpressing miR-93 or miR-484. We then compared these overexpression datasets to OS/PS datasets to identify transcripts downregulated by either miR-93 or miR-484 overexpression and OS/PS. Finally, of the co-downregulated transcripts, we identified which ones contain a reverse complement SEED sequence target.

Download and compile mRNA sequences

The first step in the miRNA prediction process is to obtain a data table containing all mRNA sequences for all species available on the ensembl database (http://uswest.ensembl.org/info/data/ftp/index.html). Once the files have been downloaded and saved into a separate file, the seqCompile function reads all sequences and compiles them into a data table containing the mRNA sequence, the ensempl transcript ID, the name of the species file, as well as the scientific and common name or the species the sequence is found in.

transcriptome <- seqCompile(1:length(dir()), type = "mRNA")
transcriptome[,2:5]

##         ensembl_transcript_id                              Species_File   Scientific_Name Common_Name
##      1:       ENST00000632684        Homo_sapiens.GRCh38.cdna.all.fa.gz      Homo_sapiens       Human
##      2:       ENST00000434970        Homo_sapiens.GRCh38.cdna.all.fa.gz      Homo_sapiens       Human
##      3:       ENST00000448914        Homo_sapiens.GRCh38.cdna.all.fa.gz      Homo_sapiens       Human
##      4:       ENST00000415118        Homo_sapiens.GRCh38.cdna.all.fa.gz      Homo_sapiens       Human
##      5:       ENST00000631435        Homo_sapiens.GRCh38.cdna.all.fa.gz      Homo_sapiens       Human
##     ---                                                                                              
## 334607:    ENSRNOT00000090245 Rattus_norvegicus.Rnor_6.0.cdna.all.fa.gz Rattus_norvegicus         Rat
## 334608:    ENSRNOT00000093304 Rattus_norvegicus.Rnor_6.0.cdna.all.fa.gz Rattus_norvegicus         Rat
## 334609:    ENSRNOT00000092502 Rattus_norvegicus.Rnor_6.0.cdna.all.fa.gz Rattus_norvegicus         Rat
## 334610:    ENSRNOT00000089207 Rattus_norvegicus.Rnor_6.0.cdna.all.fa.gz Rattus_norvegicus         Rat
## 334611:    ENSRNOT00000086520 Rattus_norvegicus.Rnor_6.0.cdna.all.fa.gz Rattus_norvegicus         Rat

Next, the annotation information can be downloaded. this is accomplished for Human, mouse, and rat species by running A biomaRt query in a loop.

COMPANNOT <- NULL
Values <- transcriptome$ensembl_transcript_id
data.set <- c("hsapiens_gene_ensembl", "mmusculus_gene_ensembl", "rnorvegicus_gene_ensembl")
for(i in 1:length(data.set)){
  mymart <- useMart("ensembl", dataset=data.set[i])
  mRNAdata_annot <- getBM(attributes=c("ensembl_transcript_id", "external_gene_name", 
                         "transcription_start_site", "transcript_start", "transcript_end", 
                         "chromosome_name"), values = Values, mart = mymart)
  COMPANNOT <- rbind(COMPANNOT, mRNAdata_annot)
}
COMPANNOT

##         ensembl_transcript_id external_gene_name transcription_start_site chromosome_name
##      1:       ENST00000585070            MIR4723                 28360654              17
##      2:       ENST00000459194            RF00019                 61642383               9
##      3:       ENST00000620386            RF02116                 88992370              10
##      4:       ENST00000618314            MIR6085                 62343029              15
##      5:       ENST00000410938         RNU6-1328P                 94495299               7
##     ---                                                                                  
## 408354:    ENSRNOT00000076510              Eda2r                 66602458               X
## 408355:    ENSRNOT00000017429              Eda2r                 66602506               X
## 408356:    ENSRNOT00000067217            Ccdc88b                222118459               1
## 408357:    ENSRNOT00000092790            Ccdc88b                222107007               1
## 408358:    ENSRNOT00000092969            Ccdc88b                222118195               1

Finally, the transcriptome data table is annotated with the biomaRt information.

TRANS_CHR <- merge(transcriptome, COMPANNOT, by= "ensembl_transcript_id")
TRANS_CHR <- TRANS_CHR[!duplicated(TRANS_CHR$ensembl_transcript_id),]
TRANS_CHR <- TRANS_CHR[order(TRANS_CHR$Species),]
TRANS_CHR$external_gene_name <- toupper(TRANS_CHR$external_gene_name)
TRANS_CHR[,c(1,4:6, 10)]

##         ensembl_transcript_id   Scientific_Name Common_Name external_gene_name chromosome_name
##      1:       ENST00000000233      Homo_sapiens       Human               ARF5               7
##      2:       ENST00000000412      Homo_sapiens       Human               M6PR              12
##      3:       ENST00000000442      Homo_sapiens       Human              ESRRA              11
##      4:       ENST00000001008      Homo_sapiens       Human              FKBP4              12
##      5:       ENST00000001146      Homo_sapiens       Human            CYP26B1               2
##     ---                                                                                       
## 334607:    ENSRNOT00000093750 Rattus_norvegicus         Rat              FUBP3               3
## 334608:    ENSRNOT00000093751 Rattus_norvegicus         Rat             GRIN3B               7
## 334609:    ENSRNOT00000093752 Rattus_norvegicus         Rat                MPL               5
## 334610:    ENSRNOT00000093753 Rattus_norvegicus         Rat               WBP4              15
## 334611:    ENSRNOT00000093754 Rattus_norvegicus         Rat       LOC100912373               5

Format mRNA data table

Next, the mRNA data table was cleaned by eliminating all sequences belonging to extraneous chromosomes.

data.table(TRANS_CHR[!duplicated(TRANS_CHR$chromosome_name),]$chromosome_name)

##                  V1
##   1:              7
##   2:             12
##   3:             11
##   4:              2
##   5:              6
##  ---               
## 583:     KL568013.1
## 584:     KL568410.1
## 585: AABR07024206.1
## 586: AABR07022620.1
## 587:     KL568122.1

CHR_Labels <- c(1:21, "X", "Y")
TRANS_CHR <- TRANS_CHR[(TRANS_CHR$chromosome_name %in% CHR_Labels),]
TRANS_CHR[,c(1,5,6,10), with = FALSE]

##         ensembl_transcript_id Common_Name external_gene_name chromosome_name
##      1:       ENST00000000233       Human               ARF5               7
##      2:       ENST00000000412       Human               M6PR              12
##      3:       ENST00000000442       Human              ESRRA              11
##      4:       ENST00000001008       Human              FKBP4              12
##      5:       ENST00000001146       Human            CYP26B1               2
##     ---                                                                     
## 311121:    ENSRNOT00000093750         Rat              FUBP3               3
## 311122:    ENSRNOT00000093751         Rat             GRIN3B               7
## 311123:    ENSRNOT00000093752         Rat                MPL               5
## 311124:    ENSRNOT00000093753         Rat               WBP4              15
## 311125:    ENSRNOT00000093754         Rat       LOC100912373               5

data.table(TRANS_CHR[!duplicated(TRANS_CHR$chromosome_name),]$chromosome_name)

##     V1
##  1:  7
##  2: 12
##  3: 11
##  4:  2
##  5:  6
##  6: 16
##  7:  4
##  8:  3
##  9:  1
## 10:  8
## 11: 19
## 12: 17
## 13:  5
## 14: 14
## 15:  X
## 16: 10
## 17: 18
## 18: 20
## 19: 13
## 20: 15
## 21:  Y
## 22:  9
## 23: 21
##     V1

Finally, we eliminated all splice variants and retained only the variants with the longest sequence.

PROMTrans <- VariantSort(TRANS_CHR, variant = "MAX")
PROMTrans[,c(1, 5, 6, 10:11), with = FALSE]

##        ensembl_transcript_id Common_Name external_gene_name chromosome_name Length
##     1:       ENST00000463733       Human               ARF5               7   1509
##     2:       ENST00000000412       Human               M6PR              12   2756
##     3:       ENST00000405666       Human              ESRRA              11   2283
##     4:       ENST00000001008       Human              FKBP4              12   3732
##     5:       ENST00000001146       Human            CYP26B1               2   4732
##    ---                                                                            
## 93707:    ENSRNOT00000093717         Rat     AABR07059381.1               4    175
## 93708:    ENSRNOT00000093724         Rat     AABR07040760.1               X    414
## 93709:    ENSRNOT00000093732         Rat     AABR07003509.1               1   1974
## 93710:    ENSRNOT00000093742         Rat         AC125565.1               2   1005
## 93711:    ENSRNOT00000093749         Rat          LOC681392               X   1744

Format miRNA data table

Now that the mRNA data table is formatted, we need to format the miRNA data table to include the species the miRNA is from as well as the miRNA targeting information.

MiRNADT[,2:ncol(MiRNADT)]

##                miRNA_Name        miRNA_type               Common_Name        Scientific_Name          miRNA
##     1:          CEL-LET-7 IMMATURE_HAIR_PIN                 roundworm Caenorhabditis_elegans          LET-7
##     2:          CEL-LIN-4 IMMATURE_HAIR_PIN                 roundworm Caenorhabditis_elegans        LIN-LIN
##     3:          CEL-MIR-1 IMMATURE_HAIR_PIN                 roundworm Caenorhabditis_elegans          MIR-1
##     4:          CEL-MIR-2 IMMATURE_HAIR_PIN                 roundworm Caenorhabditis_elegans          MIR-2
##     5:         CEL-MIR-34 IMMATURE_HAIR_PIN                 roundworm Caenorhabditis_elegans         MIR-34
##    ---                                                                                                     
## 87470:   SMC-MIR-12461-3P            MATURE                        NA                     NA   MIR-12461-3P
## 87471:   GGA-MIR-1784B-5P            MATURE                   Chicken          Gallus_gallus   MIR-1784B-5P
## 87472:   GGA-MIR-1784B-3P            MATURE                   Chicken          Gallus_gallus   MIR-1784B-3P
## 87473:   MDO-MIR-7385G-3P            MATURE Gray short-tailed opossum  Monodelphis_domestica   MIR-7385G-3P
## 87474: MDO-MIR-7385G-2-3P            MATURE Gray short-tailed opossum  Monodelphis_domestica MIR-7385G-2-3P

First, we subsetted the data table to return miRNAs of interest.

MiDT <- MiRNADT[MiRNADT$miRNA == "MIR-93-3P" | MiRNADT$miRNA == "MIR-93-5P" | MiRNADT$miRNA == "MIR-484",]
MiDT <- MiDT[MiDT$Common_Name == "Human" | MiDT$Common_Name == "Mouse" | MiDT$Common_Name == "Rat",]
MiDT <- MiDT[MiDT$miRNA_type == "MATURE",]
MiDT[,2:ncol(MiDT)]

##       miRNA_Name miRNA_type Common_Name   Scientific_Name     miRNA
## 1: HSA-MIR-93-5P     MATURE       Human      Homo_sapiens MIR-93-5P
## 2: HSA-MIR-93-3P     MATURE       Human      Homo_sapiens MIR-93-3P
## 3: MMU-MIR-93-5P     MATURE       Mouse      Mus_musculus MIR-93-5P
## 4: MMU-MIR-93-3P     MATURE       Mouse      Mus_musculus MIR-93-3P
## 5: RNO-MIR-93-5P     MATURE         Rat Rattus_norvegicus MIR-93-5P
## 6: RNO-MIR-93-3P     MATURE         Rat Rattus_norvegicus MIR-93-3P
## 7:   HSA-MIR-484     MATURE       Human      Homo_sapiens   MIR-484
## 8:   MMU-MIR-484     MATURE       Mouse      Mus_musculus   MIR-484
## 9:   RNO-MIR-484     MATURE         Rat Rattus_norvegicus   MIR-484

Next, we determined the miRNA seed sequence (positions 2-8 from the 5’ end), that must be perfectly complimentary for miRNA-mRNA binding (10).

MiDT <- MISeed(MiDT); MiDT[,2:ncol(MiDT)]

##       miRNA_Name miRNA_type Common_Name   Scientific_Name     miRNA seed_Sequence
## 1: HSA-MIR-93-5P     MATURE       Human      Homo_sapiens MIR-93-5P       AAAGUGC
## 2: HSA-MIR-93-3P     MATURE       Human      Homo_sapiens MIR-93-3P       CUGCUGA
## 3: MMU-MIR-93-5P     MATURE       Mouse      Mus_musculus MIR-93-5P       AAAGUGC
## 4: MMU-MIR-93-3P     MATURE       Mouse      Mus_musculus MIR-93-3P       CUGCUGA
## 5: RNO-MIR-93-5P     MATURE         Rat Rattus_norvegicus MIR-93-5P       AAAGUGC
## 6: RNO-MIR-93-3P     MATURE         Rat Rattus_norvegicus MIR-93-3P       CUGCUGA
## 7:   HSA-MIR-484     MATURE       Human      Homo_sapiens   MIR-484       CAGGCUC
## 8:   MMU-MIR-484     MATURE       Mouse      Mus_musculus   MIR-484       CAGGCUC
## 9:   RNO-MIR-484     MATURE         Rat Rattus_norvegicus   MIR-484       CAGGCUC

Finally, we obtained the query sequence (the reverse complement of the SEED sequence).

MiDT <- MIQuerySeq(MiDT, wobble = TRUE); MiDT[,2:ncol(MiDT)]

##       miRNA_Name miRNA_type Common_Name   Scientific_Name     miRNA seed_Sequence          Query_Sequence
## 1: HSA-MIR-93-5P     MATURE       Human      Homo_sapiens MIR-93-5P       AAAGUGC     G(C|T)(A|G)(C|T)TTT
## 2: HSA-MIR-93-3P     MATURE       Human      Homo_sapiens MIR-93-3P       CUGCUGA T(C|T)(A|G)G(C|T)(A|G)G
## 3: MMU-MIR-93-5P     MATURE       Mouse      Mus_musculus MIR-93-5P       AAAGUGC     G(C|T)(A|G)(C|T)TTT
## 4: MMU-MIR-93-3P     MATURE       Mouse      Mus_musculus MIR-93-3P       CUGCUGA T(C|T)(A|G)G(C|T)(A|G)G
## 5: RNO-MIR-93-5P     MATURE         Rat Rattus_norvegicus MIR-93-5P       AAAGUGC     G(C|T)(A|G)(C|T)TTT
## 6: RNO-MIR-93-3P     MATURE         Rat Rattus_norvegicus MIR-93-3P       CUGCUGA T(C|T)(A|G)G(C|T)(A|G)G
## 7:   HSA-MIR-484     MATURE       Human      Homo_sapiens   MIR-484       CAGGCUC     G(A|G)G(C|T)(C|T)TG
## 8:   MMU-MIR-484     MATURE       Mouse      Mus_musculus   MIR-484       CAGGCUC     G(A|G)G(C|T)(C|T)TG
## 9:   RNO-MIR-484     MATURE         Rat Rattus_norvegicus   MIR-484       CAGGCUC     G(A|G)G(C|T)(C|T)TG

Predict miR-93 and miR-484 targets

In this section, we queried miRNA’s against mRNA sequences in human, mouse, and rat. The function MIRNATargetpredict queries the miR query sequence against all mRNA sequences. Both the miRNA query sequence and the mRNA sequences are paired for each species. It then returns a data table with paired matches.

setnames(PROMTrans, colnames(PROMTrans),
        c("ensembl_transcript_id", "Sequence", "Species_File", "Scientific_Name", "Species", 
          "external_gene_name", "transcription_start_site", "transcript_start", "transcript_end", 
          "chromosome_name", "Length"))
setnames(MiDT, c("Sequence", "miRNA_Name", "miRNA_type", "Common_Name", "Scientific_Name", 
                 "seed_Sequence", "Query_Sequence"),
               c("Sequence", "miRNA_Name", "miRNA_type", "Species", "Scientific_Name", 
                 "seed_Sequence", "Query_Sequence"))
Spec <- c("Homo_sapiens", "Mus_musculus", "Rattus_norvegicus")
MiRNAhitstot <- MIRNATargetpredict(MiRNADT = MiDT, mRNADT = PROMTrans, type = "Multiple", Spe = Spec)
MiRNAhitstot[, c(1, 3:7, 9:10), with = FALSE]

##                  Query_Sequence start  end      mRNA_Name Number_Hits length           Species     miRNA
##      1:     G(A|G)G(C|T)(C|T)TG  1289 1295           ARF5           1   1509      Homo_sapiens   MIR-484
##      2:     G(C|T)(A|G)(C|T)TTT  1307 1313           M6PR           3   2756      Homo_sapiens MIR-93-5P
##      3:     G(C|T)(A|G)(C|T)TTT  2203 2209           M6PR           3   2756      Homo_sapiens MIR-93-5P
##      4:     G(C|T)(A|G)(C|T)TTT  2685 2691           M6PR           3   2756      Homo_sapiens MIR-93-5P
##      5: T(C|T)(A|G)G(C|T)(A|G)G   569  575           M6PR           1   2756      Homo_sapiens MIR-93-3P
##     ---                                                                                                 
## 497491: T(C|T)(A|G)G(C|T)(A|G)G   916  922 AABR07003509.1           2   1974 Rattus_norvegicus MIR-93-3P
## 497492:     G(A|G)G(C|T)(C|T)TG   171  177 AABR07003509.1           1   1974 Rattus_norvegicus   MIR-484
## 497493:     G(C|T)(A|G)(C|T)TTT   960  966      LOC681392           2   1744 Rattus_norvegicus MIR-93-5P
## 497494:     G(C|T)(A|G)(C|T)TTT  1096 1102      LOC681392           2   1744 Rattus_norvegicus MIR-93-5P
## 497495:     G(A|G)G(C|T)(C|T)TG  1644 1650      LOC681392           1   1744 Rattus_norvegicus   MIR-484

Format identified target data table and impose minimum species conservation stringency

Next, the species designation were removed from the miRNA name and add this new name lacking the species information to another column in the new dataset. This new label was used to compute predicted miR-mRNA species preservation.

MiRNAhitstot <- MiRNAname(MiRNAhitstot); MiRNAhitstot[, c(1,3:7, 9:10), with = FALSE]

##                  Query_Sequence start  end      mRNA_Name Number_Hits length           Species     miRNA
##      1:     G(A|G)G(C|T)(C|T)TG  1289 1295           ARF5           1   1509      Homo_sapiens   MIR-484
##      2:     G(C|T)(A|G)(C|T)TTT  1307 1313           M6PR           3   2756      Homo_sapiens MIR-93-5P
##      3:     G(C|T)(A|G)(C|T)TTT  2203 2209           M6PR           3   2756      Homo_sapiens MIR-93-5P
##      4:     G(C|T)(A|G)(C|T)TTT  2685 2691           M6PR           3   2756      Homo_sapiens MIR-93-5P
##      5: T(C|T)(A|G)G(C|T)(A|G)G   569  575           M6PR           1   2756      Homo_sapiens MIR-93-3P
##     ---                                                                                                 
## 497491: T(C|T)(A|G)G(C|T)(A|G)G   916  922 AABR07003509.1           2   1974 Rattus_norvegicus MIR-93-3P
## 497492:     G(A|G)G(C|T)(C|T)TG   171  177 AABR07003509.1           1   1974 Rattus_norvegicus   MIR-484
## 497493:     G(C|T)(A|G)(C|T)TTT   960  966      LOC681392           2   1744 Rattus_norvegicus MIR-93-5P
## 497494:     G(C|T)(A|G)(C|T)TTT  1096 1102      LOC681392           2   1744 Rattus_norvegicus MIR-93-5P
## 497495:     G(A|G)G(C|T)(C|T)TG  1644 1650      LOC681392           1   1744 Rattus_norvegicus   MIR-484

Next, miRNA-mRNA associations conservation between human, mouse, and rat were retained for each mRNA.

setnames(MiRNAhitstot, colnames(MiRNAhitstot),
                       c("Query_Sequence", "start", "end", "miRNA_Name", "gene_symbol", "Number_Hits", 
                         "length", "Sequence", "Species", "Targeting_Factor"))
MiRNAhitstot <- SpeciesTFCons(DT = MiRNAhitstot,Spec = c("Homo_sapiens", "Mus_musculus", 
                                                         "Rattus_norvegicus"), provide = "TF_Target")
MiRNAhitstot[, c(1,3:7, 9:10), with = FALSE]

##              Query_Sequence start  end gene_symbol Number_Hits length           Species Targeting_Factor
##      1: G(A|G)G(C|T)(C|T)TG  1289 1295        ARF5           1   1509      Homo_sapiens          MIR-484
##      2: G(C|T)(A|G)(C|T)TTT  1307 1313        M6PR           3   2756      Homo_sapiens        MIR-93-5P
##      3: G(C|T)(A|G)(C|T)TTT  2203 2209        M6PR           3   2756      Homo_sapiens        MIR-93-5P
##      4: G(C|T)(A|G)(C|T)TTT  2685 2691        M6PR           3   2756      Homo_sapiens        MIR-93-5P
##      5: G(A|G)G(C|T)(C|T)TG   412  418       ESRRA           4   2283      Homo_sapiens          MIR-484
##     ---                                                                                                 
## 308526: G(A|G)G(C|T)(C|T)TG  1710 1716    IL1RAPL2           5   3264 Rattus_norvegicus          MIR-484
## 308527: G(A|G)G(C|T)(C|T)TG  1956 1962    IL1RAPL2           5   3264 Rattus_norvegicus          MIR-484
## 308528: G(A|G)G(C|T)(C|T)TG  1967 1973    IL1RAPL2           5   3264 Rattus_norvegicus          MIR-484
## 308529: G(C|T)(A|G)(C|T)TTT   902  908      POU5F2           1   1235 Rattus_norvegicus        MIR-93-5P
## 308530: G(A|G)G(C|T)(C|T)TG  1108 1114      POU5F2           1   1235 Rattus_norvegicus          MIR-484

The number of predicted interactions before and after minimum species filtering is as follows:

## [1] 497495      8

## [1] 308530      8

Download and analyze miR-93 overexpression RNAseq

In this section, RNAseq data generated from miR-93 overexpression in human umbilical vein endothelial cells (HUVEC) available online through gene expression omnibus was analyzed. The data is freely available at the following link: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE86497. Because this is an RNAseq dataset, we downloaded the FASTq files, aligned them to the human GRCh38 reference genome, and computed fold changes relative to control. This is a multistep process and is outlined below. The first step in this process is to download the human genome and corresponding gtf file:

URL <- "ftp://ftp.ensembl.org/pub/release-85/gtf/homo_sapiens/Homo_sapiens.GRCh38.85.gtf.gz"
download.file(URL, destfile = "./data/GTF.gtf.gz")

URLGEN <- "ftp://ftp.ensembl.org/pub/release-85/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz"
download.file(URLGEN, destfile = "./data/genome.fa.gz")

After downloading the files, the file were opened and the numer of A, T, C, or G nucleotides were counted to get a sense of the genomic sequences. The frequencies of the nucleotides and the percentage of “N” not called nucleotides can be computed as follows:

myseq <- readDNAStringSet("./data/genome.fasta")
freq <- alphabetFrequency(myseq)[,c(1:4, 15)]
freq <- data.table(freq)
per <- NULL
for(i in 1:nrow(freq)){
  per[i] <- freq[i,5, with = FALSE]/(freq[i,1, with = FALSE]+
            freq[i,2, with = FALSE]+freq[i,3, with = FALSE]+
            freq[i,4, with = FALSE]+freq[i,5, with = FALSE])
}
per <- unlist(per)
freq$percentN <- per
freq$name <- names(myseq)
freq[1:22,1:6]

##            A        C        G        T        N     percentN
##  1: 67070277 48055043 48111528 67244164 18475410 0.0742114216
##  2: 38875926 27639505 27719976 39027555   534460 0.0039945463
##  3: 39286730 27903257 27981801 39361954   552880 0.0040927813
##  4: 39370109 27092804 27182678 39492225   137493 0.0010316465
##  5: 30047611 18839192 18933605 30162717 16381203 0.1432369978
##  6: 26673415 18423758 18559033 26911943 16475569 0.1539143941
##  7: 24508669 17752941 17825903 24553812 17349864 0.1701114005
##  8: 22558319 18172742 18299976 22774906  8532402 0.0944493947
##  9: 22639499 18723944 18851500 22705261   337237 0.0040505329
## 10: 24050680 15794455 16061651 24182819   283680 0.0035295310
## 11: 15142293 13954580 14061132 15282753   176858 0.0030171476
## 12: 71791213 48318180 48450903 71987932  1645301 0.0067933318
## 13: 17867246 13916133 14094472 18066406   499910 0.0077572575
## 14: 11820664  8185244  8226381 11856330  6621364 0.1417547936
## 15: 10382214  9160652  9246186 10370725 11658691 0.2294183878
## 16: 59689091 39233483 39344259 59833302   195424 0.0009855188
## 17: 58561236 36236976 36331025 58623430   461888 0.0024282474
## 18: 54699094 35731600 35879674 54955010   272881 0.0015031597
## 19: 51345477 33646690 33713330 51373025   727457 0.0042589668
## 20: 47058248 32317984 32378859 47215040   375842 0.0023586539
## 21: 43333530 29030173 29103787 43300646   370500 0.0025527317
## 22: 35736329 25099811 25170662 35783748 16604167 0.1199768847
##            A        C        G        T        N     percentN

Next, to automate this process, I used the systempipeR bioconductor package. This package initially requires the user to set up a text file indicating which files to align, which aligner and parameters to use, and what samples to calculate the fold changes from. This information is indicated below.

targets <- read.delim("targets.txt", comment.char ="#")
targets[,1:4]

##                                                       FileName SampleName      Factor  SampleLong
## 1              ./data/SRX2143034_SRR4183476_Mock_rep1.fastq.gz      MOCK1        MOCK      MOCK_1
## 2              ./data/SRX2143035_SRR4183477_Mock_rep2.fastq.gz      MOCK2        MOCK      MOCK_2
## 3     ./data/SRX2143036_SRR4183478_Mimic_control_rep1.fastq.gz CTRLMIMIC1   MIMICCTRL MIMICCTRL_1
## 4     ./data/SRX2143037_SRR4183479_Mimic_control_rep2.fastq.gz CTRLMIMIC2   MIMICCTRL MIMICCTRL_2
## 5             ./data/SRX2143038_SRR4183480_Mimic_rep1.fastq.gz     MIMIC1       MIMIC      MIMIC1
## 6             ./data/SRX2143039_SRR4183481_Mimic_rep2.fastq.gz     MIMIC2       MIMIC      MIMIC2
## 7             ./data/SRX2143040_SRR4183482_Mimic_rep3.fastq.gz     MIMIC3       MIMIC      MIMIC3
## 8    ./data/SRX2143041_SRR4183483_CTRL_inhibitor_rep1.fastq.gz CTRLINHIB1   CTRLINHIB CTRLINHIB_1
## 9    ./data/SRX2143042_SRR4183484_CTRL_inhibitor_rep2.fastq.gz CTRLINHIB2   CTRLINHIB CTRLINHIB_2
## 10 ./data/SRX2143043_SRR4183485_miR484_inhibitor_rep1.fastq.gz     INHIB1 MIR484INHIB     INHIB_1
## 11 ./data/SRX2143044_SRR4183486_miR484_inhibitor_rep2.fastq.gz     INHIB2 MIR484INHIB     INHIB_2
## 12 ./data/SRX2143045_SRR4183487_miR484_inhibitor_rep3.fastq.gz     INHIB3 MIR484INHIB     INHIB_3
## 13            ./data/SRX2143046_SRR4183488_no_tx_rep1.fastq.gz      NOTX1       NO_TX      NOTX_1
## 14            ./data/SRX2143047_SRR4183489_no_tx_rep2.fastq.gz      NOTX2       NO_TX      NOTX_2

Next, I loaded the Hisat2.param file

parampath <- system.file("extdata", "hisat2.param", package="systemPipeR")
read.delim(parampath, comment.char = "#")

##      PairSet         Name                                          Value
## 1    modules         <NA>                                   hisat2/2.0.1
## 2   software         <NA>                                         hisat2
## 3      cores           -p                                              7
## 4      other         <NA>   -k 1 --min-intronlen 30 --max-intronlen 3000
## 5   outfile1           -S                                    <FileName1>
## 6   outfile1         path                                     ./results/
## 7   outfile1       remove                                           <NA>
## 8   outfile1       append                                     .hisat.sam
## 9   outfile1 outextension                                     .hisat.bam
## 10 reference         <NA> ./data/Homo_sapiens.GRCh38.dna.chromosome.1.fa
## 11   infile1           -U                                    <FileName1>
## 12   infile1         path                                           <NA>
## 13   infile2         <NA>                                    <FileName2>
## 14   infile2         path                                           <NA>

The following information indicates the alignment parameters that were used.

args <- systemArgs(sysma=parampath, mytargets="targets.txt")
args

## An instance of 'SYSargs' for running 'hisat2' on 14 samples

modules(args)

## [1] "hisat2/2.0.1"

names(args)

##  [1] "targetsin"     "targetsout"    "targetsheader" "modules"       "software"      "cores"         "other"         "reference"     "results"       "infile1"       "infile2"       "outfile1"      "sysargs"       "outpaths"

sysargs(args)[1]

##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             MOCK1 
## "hisat2 -p 7 -k 1 --min-intronlen 30 --max-intronlen 3000 -S C:/Users/breng/Dropbox/Brendan Documents/Research/Research project/3 NCL manuscript/NCL final programming notes/results/SRX2143034_SRR4183476_Mock_rep1.fastq.gz.hisat.sam C:/Users/breng/Dropbox/Brendan Documents/Research/Research project/3 NCL manuscript/NCL final programming notes/data/Homo_sapiens.GRCh38.dna.chromosome.1.fa -U C:\\Users\\breng\\Dropbox\\Brendan Documents\\Research\\Research project\\3 NCL manuscript\\NCL final programming notes\\data\\SRX2143034_SRR4183476_Mock_rep1.fastq.gz "

Next, the reference genome was indexed.

system("hisat2-build ./data/genome.fasta ./data/genome.fasta")

In this section of code, the reads were aligned to the human genome (GRCh38) with hisat2.

bampaths <- runCommandline(args=args)
(read_statsDF <- alignStats(args=args))

##       FileName   Nreads   Nalign Perc_Aligned Nalign_Primary Perc_Aligned_Primary
##  1:      MOCK1 21759864 13689527     62.91182       13689527             62.91182
##  2:      MOCK2  6904259  3013524     43.64732        3013524             43.64732
##  3: CTRLMIMIC1 14856688  5879712     39.57620        5879712             39.57620
##  4: CTRLMIMIC2 14197084  8190945     57.69456        8190945             57.69456
##  5:     MIMIC1 20007656  8703905     43.50287        8703905             43.50287
##  6:     MIMIC2 21783252  8890267     40.81240        8890267             40.81240
##  7:     MIMIC3 21201661 11391097     53.72738       11391097             53.72738
##  8: CTRLINHIB1 22913477 17917546     78.19654       17917546             78.19654
##  9: CTRLINHIB2 16970400 13574453     79.98900       13574453             79.98900
## 10:     INHIB1 24332799 18899658     77.67153       18899658             77.67153
## 11:     INHIB2 19486865 10326784     52.99356       10326784             52.99356
## 12:     INHIB3 19487579 15189465     77.94434       15189465             77.94434
## 13:      NOTX1 21987954 13610564     61.90009       13610564             61.90009
## 14:      NOTX2 21812337 15880264     72.80405       15880264             72.80405

Once the reads were aligned to the genome, we determined which reads belong to gene bodies. To do this, we counted reads per Feature and Stored the Annotations in TranscriptDb.

txdb <- makeTxDbFromGFF(file="data/GTF.gtf.gz",
                        format = "gtf",
                        dataSource= "ensembl",
                        organism ="Homo sapiens")
saveDb(txdb, file="./data/GTF.sqlite")
txdb <- loadDb("./data/GTF.sqlite")
eByg <- exonsBy(txdb, by="gene")

Using the TranscriptDb information, we computed the raw number of reads per gene with summarizeOverlaps.

txdb <- loadDb("./data/GTF.sqlite")
eByg <- exonsBy(txdb, by=c("gene"))
bfl <- BamFileList(outpaths(args), yieldSize=50000, index=character())
multicoreParam <- MulticoreParam(workers=7); register(multicoreParam)
counteByg <- bplapply(bfl, function(x) summarizeOverlaps(eByg, x, mode="Union", ignore.strand=TRUE, 
                                                         inter.feature=TRUE, singleEnd=TRUE))
countDFeByg <- sapply(seq(along=counteByg), function(x) assays(counteByg[[x]])$counts)
rownames(countDFeByg) <- names(rowRanges(counteByg[[1]])); colnames(countDFeByg) <- names(bfl)
rpkmDFeByg <- apply(countDFeByg, 2, function(x) returnRPKM(counts=x, ranges=eByg))

print read counts

counteByg[,c(1,2,4,6,9,11,14)]

##                     V1 MOCK1 CTRLMIMIC1 MIMIC1 CTRLINHIB1 INHIB1 NOTX1
##     1: ENSG00000000003   363        134    271        457    555   497
##     2: ENSG00000000005     0          0      0          0      0     0
##     3: ENSG00000000419   451        220    285        614    679   415
##     4: ENSG00000000457    93         34     44        107    127    72
##     5: ENSG00000000460   152         32     43        166    147   129
##    ---                                                                
## 58298: ENSG00000284744     0          0      0          0      0     0
## 58299: ENSG00000284745     0          0      0          0      0     0
## 58300: ENSG00000284746     0          0      0          0      0     0
## 58301: ENSG00000284747     2          1      0          5      5     1
## 58302: ENSG00000284748     0          0      0          0      0     0

print RPKM normalized read counts

rpkmDFeByg[,c(1,2,4,6,9,11,14)]

##                     V1       MOCK1  CTRLMIMIC1     MIMIC1  CTRLINHIB1      INHIB1       NOTX1
##     1: ENSG00000000003  6.82658351  5.85178810  8.0976434  6.56985939  7.72815299  9.55858900
##     2: ENSG00000000005  0.00000000  0.00000000  0.0000000  0.00000000  0.00000000  0.00000000
##     3: ENSG00000000419 31.86715887 36.09744764 31.9966299 33.16486753 35.52405416 29.98855345
##     4: ENSG00000000457  1.15233633  0.97827788  0.8662466  1.01349865  1.16516108  0.91236668
##     5: ENSG00000000460  2.17250920  1.06207461  0.9765153  1.81371572  1.55568375  1.88559476
##    ---                                                                                       
## 58298: ENSG00000284744  0.00000000  0.00000000  0.0000000  0.00000000  0.00000000  0.00000000
## 58299: ENSG00000284745  0.00000000  0.00000000  0.0000000  0.00000000  0.00000000  0.00000000
## 58300: ENSG00000284746  0.00000000  0.00000000  0.0000000  0.00000000  0.00000000  0.00000000
## 58301: ENSG00000284747  0.04144085  0.04811558  0.0000000  0.07919756  0.07671047  0.02119041
## 58302: ENSG00000284748  0.00000000  0.00000000  0.0000000  0.00000000  0.00000000  0.00000000

Next, we conducted differential expressed gene analysis using edgeR.

countDF <- rpkmDFeByg
cmp <- readComp(args, format="matrix", delim="-")
edgeDF <- run_edgeR(countDF=countDF, targets=targetsin(args), cmp=cmp[[1]], independent=FALSE, mdsplot="")

##                     V1 MIMIC-MIR484INHIB_logFC MIMIC-MIR484INHIB_PValue MIMIC-MIR484INHIB_FDR
##     1: ENSG00000000003              0.30474373                0.4877807                     1
##     2: ENSG00000000005                      NA                       NA                    NA
##     3: ENSG00000000419              0.08857267                0.7247513                     1
##     4: ENSG00000000457             -0.26894805                0.8080476                     1
##     5: ENSG00000000460             -0.17241237                0.8612456                     1
##    ---                                                                                       
## 58298: ENSG00000284744                      NA                       NA                    NA
## 58299: ENSG00000284745                      NA                       NA                    NA
## 58300: ENSG00000284746                      NA                       NA                    NA
## 58301: ENSG00000284747                      NA                       NA                    NA
## 58302: ENSG00000284748                      NA                       NA                    NA

Finally, to annotate the gene names we downloaded the annotation information from biomart:

Values <- row.names(edgeDF)
listMarts() # To choose BioMart database
# myMart <- useMart("ENSEMBL_MART_ENSEMBL"); listDatasets(myMart)
myMart <- useMart("ENSEMBL_MART_ENSEMBL", dataset="hsapiens_gene_ensembl")
# listAttributes(myMart)[1:100,] # Choose data types you want to download
go <- getBM(attributes=c("ensembl_gene_id", "ensembl_transcript_id", "external_gene_name", 
                         "wikigene_name", "wikigene_description", "transcription_start_site", 
                         "strand", "go_id", "definition_1006"), mart=myMart, values = Values) 
go <- go[go[,8]!="",]; go[,8] <- as.character(go[,8])
go[,c(1,4,8)]

##          ensembl_gene_id wikigene_name      go_id
##       1: ENSG00000198888           ND1 GO:0016020
##       2: ENSG00000198888           ND1 GO:0016021
##       3: ENSG00000198888           ND1 GO:0016491
##       4: ENSG00000198888           ND1 GO:0055114
##       5: ENSG00000198888           ND1 GO:0005743
##      ---                                         
## 1043212: ENSG00000283951       KIR3DL2 GO:0050776
## 1043213: ENSG00000283951       KIR3DL2 GO:0005887
## 1043214: ENSG00000283951       KIR3DL2 GO:0006968
## 1043215: ENSG00000283882               GO:0016020
## 1043216: ENSG00000283882               GO:0016021

Once biomart annotations were downloaded, we merged this information with the computed fold changes to obtain a final miR-93 overexpression data set.

DEdata <- fread("./data/7 edgeRcomp.xls")
setnames(DEdata, c(colnames(DEdata)),
         c("ensembl_gene_id",colnames(DEdata)[2:length(DEdata)]))
setkey(DEdata,ensembl_gene_id)
head(DEdata)

BioGO <- fread("./data/GO/GO BioMart annotations.xls")
BioGO <- BioGO[,c(1:2,4:9), with = FALSE]
DEdata1 <- merge(DEdata, BioGO, by="ensembl_gene_id")
DEdata1 <- DEdata1[!DEdata1$wikigene_name =="",]
DEdata1 <- DEdata1[!grepl("Rik", wikigene_name)]
DEdata1 <- DEdata1[complete.cases(DEdata1),]
DEdata1[,c(1:2, 5, 8, 12)]

##         ensembl_gene_id MIMIC-MIR484INHIB_logFC MIMIC-MIR484INHIB_PValue wikigene_name      go_id
##      1: ENSG00000000003               0.3047437                0.4877807        TSPAN6 GO:0016020
##      2: ENSG00000000003               0.3047437                0.4877807        TSPAN6 GO:0016021
##      3: ENSG00000000003               0.3047437                0.4877807        TSPAN6 GO:0004871
##      4: ENSG00000000003               0.3047437                0.4877807        TSPAN6 GO:0005887
##      5: ENSG00000000003               0.3047437                0.4877807        TSPAN6 GO:0005515
##     ---                                                                                          
## 639893: ENSG00000284308              -0.2009451                0.7769564       C2orf81           
## 639894: ENSG00000284585              -2.6470819                0.1071606       MIR4722           
## 639895: ENSG00000284693              -0.9663936                0.5291754  LOC100506022           
## 639896: ENSG00000284693              -0.9663936                0.5291754  LOC100506022           
## 639897: ENSG00000284693              -0.9663936                0.5291754  LOC100506022

Then, we merged the data together with the raw read counts.

rawdata <- fread("./data/5 countDFeByg.xls")
setnames(rawdata, c(colnames(rawdata)),
         c("ensembl_gene_id",colnames(rawdata)[2:length(rawdata)]))
rawmerge <- merge(DEdata, rawdata, by="ensembl_gene_id")
rawmerge <- rawmerge[,c("ensembl_gene_id", "ensembl_transcript_id", "wikigene_name", 
                        "wikigene_description", colnames(rawmerge)[!(colnames(rawmerge) %in% 
                      c("ensembl_gene_id", "ensembl_transcript_id", "wikigene_name", 
                        "wikigene_description", "transcription_start_site", "strand", 
                        "go_id", "definition_1006"))] ), with = FALSE]

Example print out of the first columns of data table:

rawmerge[,c(1, 3, 5, 8)]

##        ensembl_gene_id wikigene_name MIMIC-MIR484INHIB_logFC MIMIC-MIR484INHIB_PValue
##     1: ENSG00000000003        TSPAN6              0.30474373                0.4877807
##     2: ENSG00000000419          DPM1              0.08857267                0.7247513
##     3: ENSG00000000457         SCYL3             -0.26894805                0.8080476
##     4: ENSG00000000460      C1orf112             -0.17241237                0.8612456
##     5: ENSG00000000938           FGR              1.32518418                0.3540597
##    ---                                                                               
## 12779: ENSG00000284032        MIR29A              0.23317201                0.8497746
## 12780: ENSG00000284154       MIR3605             -1.43389604                0.3596510
## 12781: ENSG00000284308       C2orf81             -0.20094508                0.7769564
## 12782: ENSG00000284585       MIR4722             -2.64708194                0.1071606
## 12783: ENSG00000284693  LOC100506022             -0.96639357                0.5291754

Eample print out of the last columns of data table:

rawmerge[,c(10:23)]

##        MOCK1 CTRLMIMIC1 MIMIC1 CTRLINHIB1 INHIB1 NOTX1
##     1:   363        134    271        457    555   497
##     2:   451        220    285        614    679   415
##     3:    93         34     44        107    127    72
##     4:   152         32     43        166    147   129
##     5:     9          5     14          5      7     5
##    ---                                                
## 12779:     0          1      0          0      0     0
## 12780:     0          0      0          1      1     0
## 12781:   162         51     52        218    132   114
## 12782:     0          0      0          0      0     1
## 12783:    13          9      2         19     20    26

Download miR-484 overexpression RNAseq from GEO

In this section, we obtained GEO data of miR-484 overexpression in HUVECs. This data was gnerated using an affymetrix microarray and can be downloaded using the GEO2R workflow. The data is available here: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE66844 and the code to download it is illustrated below.

#### load series and platform data from GEO
gset <- getGEO("GSE66844", GSEMatrix =TRUE, AnnotGPL=FALSE)
if (length(gset) > 1) idx <- grep("GPL15207", attr(gset, "names")) else idx <- 1
gset <- gset[[idx]]
#### make proper column names to match toptable 
fvarLabels(gset) <- make.names(fvarLabels(gset))
# group names for all samples
gsms <- "101010"
sml <- c()
for (i in 1:nchar(gsms)) { sml[i] <- substr(gsms,i,i) }
#### subset out desiredsamples (eliminate samples marked as "X") ####
sel <- which(sml != "X")
sml <- sml[sel]
gset <- gset[ ,sel]
#### log2 transform
ex <- exprs(gset)
#### Differential expression analysis with limma ####
qx <- as.numeric(quantile(ex, c(0., 0.25, 0.5, 0.75, 0.99, 1.0), na.rm=T))
LogC <- (qx[5] > 100) ||
  (qx[6]-qx[1] > 50 && qx[2] > 0) ||
  (qx[2] > 0 && qx[2] < 1 && qx[4] > 1 && qx[4] < 2)
if (LogC) { ex[which(ex <= 0)] <- NaN
exprs(gset) <- log2(ex) }
#### set up the data and proceed with analysis with limma ####
sml <- paste("G", sml, sep="")    # set group names
fl <- as.factor(sml)
gset$description <- fl
design <- model.matrix(~ description + 0, gset)
colnames(design) <- levels(fl)
fit <- lmFit(gset, design)
cont.matrix <- makeContrasts(G1-G0, levels=design)
fit2 <- contrasts.fit(fit, cont.matrix)
fit2 <- eBayes(fit2, 0.01)
tT <- topTable(fit2, adjust="fdr", sort.by="B", number=15000000)

The data was then cleaned and transformed into a data table.

#### Subset data, remove entries with no gene name, and transform into a data.table ####
sub <- tT[,c(1:ncol(tT))]
head(sub)
sub <- as.data.table(sub)
sub2 <- sub[,c(1:ncol(sub)), with = FALSE]
sub2 <- sub2[!sub2$Gene.Symbol == "",]
sub2 <- sub2[!sub2$Gene.Symbol == "---",]
sub2[,c(1,15,18,25,28)]

##                   ID Gene.Symbol         Ensembl         logFC     P.Value
##     1:   11760322_at     ZSCAN18             --- -4.714000e-01 0.005471052
##     2: 11750788_a_at     CSNK1G1 ENSG00000169118  4.928433e-01 0.008587143
##     3: 11739326_a_at       PTPN2 ENSG00000175354  4.904500e-01 0.008630547
##     4:   11759792_at      SH3YL1 ENSG00000035115 -4.000500e-01 0.009816660
##     5: 11756563_a_at      RCBTB2 ENSG00000136161 -5.084000e-01 0.009973088
##    ---                                                                    
## 48803: 11750654_s_at       ITPR1 ENSG00000150995  3.333333e-05 0.999976212
## 48804:   11729898_at     C5orf54 ENSG00000221886 -6.666667e-06 0.999984979
## 48805: 11719501_a_at      NFATC3 ENSG00000072736  6.666667e-06 0.999985909
## 48806: 11757538_a_at        GOT1 ENSG00000120053 -1.000000e-05 0.999987493
## 48807: 11756174_s_at         SCD ENSG00000099194  0.000000e+00 1.000000000

Annotate shear RNAseq data with miR-94 and miR-484 predicted targets

In this section, we will combine the in silico miR-93 and miR-484 target prediction with OS/PS and miR-93 or miR-484 RNAseq data. To accomplish this, we will first annotate the OS/PS RNAseq data with predicted targets of miR-93, miR-484, or both miR-93 and miR-484. This information will then be compared to miR-93 or miR-484 overexpression data in the following sections. first miR-target keys were created:

mirtarg <- MiRNAhitstot #fread("./results/27 miRNA target hits HMR conserved multiple species.xls")
mir935p <- mirtarg[mirtarg$Targeting_Factor == "MIR-93-5P",]
mir935p <- unique(mir935p[mir935p$Species == "Homo_sapiens",][,c(5,10), with = FALSE])
mir933p <- mirtarg[mirtarg$Targeting_Factor == "MIR-93-3P",]
mir933p <- unique(mir933p[mir933p$Species == "Homo_sapiens",][,c(5,10), with = FALSE])
mir484 <- mirtarg[mirtarg$Targeting_Factor == "MIR-484",]
mir484 <- unique(mir484[mir484$Species == "Homo_sapiens",][,c(5,10), with = FALSE])
key3p <- mir933p$gene_symbol; key5p <- mir935p$gene_symbol
key484 <- mir484$gene_symbol
p3 <- setdiff(key3p, key5p); p3 <- setdiff(p3, key484)
p5 <- setdiff(key5p, key3p); p5 <- setdiff(p5, key484)
m484 <- setdiff(key484, key5p); m484 <- setdiff(key484, key3p)
p3p5 <- intersect(key3p, key5p); p3p5 <- setdiff(p3p5, key484)
p3m484 <- intersect(key3p, key484); p3m484 <- setdiff(p3m484, key5p)
p5m484<- intersect(key5p, key484); p5m484 <- setdiff(p5m484, key3p)
p3p5m484 <- intersect(key5p, key484); p5m484 <- intersect(p5m484, key3p)

next, these keys were used to annotate the PS/OS RNAseq dataset.
Load the PSOS RNAseq dataset

PSOS <- fread("./data/5 PSOS FC data.xls")
PSOS <- PSOS[!duplicated(PSOS$GeneName),][,c(2,21,22)]
setnames(PSOS, colnames(PSOS), c("Gene.symbol", "PS24-OS24_logFC", "PS24-OS24_PValue"))
DT_fin <- PSOS
DT_fin

##         Gene.symbol PS24-OS24_logFC PS24-OS24_PValue
##     1:          A2M     0.209494340        0.9153554
##     2:        ACADM     0.085649521        0.8633036
##     3:        ACADS    -0.043946288        0.9498366
##     4:       ACADVL     0.368829343        0.4124650
##     5:        ACAT1     0.197154902        0.4887729
##    ---                                              
## 13116: EEF1E1-MUTED    -0.006299207        0.9868662
## 13117:    LOC645513     0.049879873        0.9547314
## 13118: LOC100288911    -0.010454722        0.9821938
## 13119: LOC100527964     0.362486676        0.7770170
## 13120:  TMX2-CTNND1     0.383143195        0.7912074

Finally, we annotated the PSOS RNAseq dataset with miR-93 and miR-484 predicted targets.

intgene <- DT_fin[DT_fin$`PS24-OS24_PValue` < 0.99,]
p3gene <- intgene[(intgene$Gene.symbol %in% p3),]; p3gene$classification <- "miR-93-3p"
p5gene <- intgene[(intgene$Gene.symbol %in% p5),]; p5gene$classification <- "miR-93-5p"
m484gene <- intgene[(intgene$Gene.symbol %in% m484),]; m484gene$classification <- "miR-484"
p3p5gene <- intgene[(intgene$Gene.symbol %in% p3p5),]; p3p5gene$classification <- "miR-93-3p & miR-93-5p"
p3m484gene <- intgene[(intgene$Gene.symbol %in% p3m484),]; p3m484gene$classification <- "miR-93-3p & miR-484"
p5m484gene <- intgene[(intgene$Gene.symbol %in% p5m484),]; p5m484gene$classification <- "miR-93-5p & miR-484"
p3p5m484gene <- intgene[(intgene$Gene.symbol %in% p3p5m484),]
p3p5m484gene$classification <- "miR-93-3p & miR-93-5p & miR-484"
targs <- rbind(p3gene, p5gene, m484gene, p3p5gene, p3m484gene, p5m484gene, p3p5m484gene)
DT_not <- DT_fin[!(DT_fin$Gene.symbol %in% targs$Gene.symbol),];DT_not$classification <- "Not targeted"
Final <- rbind(targs, DT_not)
Final

##         Gene.symbol PS24-OS24_logFC PS24-OS24_PValue classification
##     1:        PSEN1     0.024171231       0.97050643      miR-93-3p
##     2:          AGA    -0.214242712       0.70493347      miR-93-3p
##     3:       BCKDHB    -0.535026027       0.59976335      miR-93-3p
##     4:         CDK4    -0.531227924       0.01165379      miR-93-3p
##     5:          FAH    -0.503621322       0.23975900      miR-93-3p
##    ---                                                             
## 13648: EEF1E1-MUTED    -0.006299207       0.98686616   Not targeted
## 13649:    LOC645513     0.049879873       0.95473142   Not targeted
## 13650: LOC100288911    -0.010454722       0.98219385   Not targeted
## 13651: LOC100527964     0.362486676       0.77701702   Not targeted
## 13652:  TMX2-CTNND1     0.383143195       0.79120740   Not targeted

Explore miR-484 putative targets

Once the PS/OS datasets were annotated with the predicted miR-93 and miR-484 targets, they were merged with the miR-484 overexpression data.
First we subsetted both the miR-484 fold change data and shear data table (labeled PSOS in the code) to include only targets identified in both data tables. Then the two data tables were merged together.

array484 <- sub2[, c(15, 25, 28)]
array484 <- array484[,.SD[which.min(logFC)],by=list(Gene.Symbol)]
array484 <- array484[(array484$Gene.Symbol %in% Final$Gene.symbol),]
setnames(array484, colnames(array484), c("Gene.symbol", "logFC", "P.Value"))
setnames(PSOS, colnames(PSOS), c("Gene.symbol", "PS24-OS24_logFC", "PS24-OS24_PValue"))
finDT <- merge(PSOS, array484, by = "Gene.symbol")
finDT$`PS24-OS24_logFC` <- finDT$`PS24-OS24_logFC` * -1
setnames(finDT, colnames(finDT), c("Gene.symbol", "OS24-PS24_logFC", "OS24-PS24_PValue", "logFC", "P.Value"))
finDT

##        Gene.symbol OS24-PS24_logFC OS24-PS24_PValue        logFC   P.Value
##     1:        A1BG    -0.043704994       0.97806237 -0.234666667 0.2455293
##     2:       A2LD1    -0.006677226       0.99360186  0.210973333 0.7649909
##     3:         A2M    -0.209494340       0.91535537 -0.235710000 0.6905679
##     4:      A4GALT    -0.939078919       0.03600162 -0.067770000 0.8731722
##     5:        AAAS     0.055997313       0.95732985 -0.010286667 0.9287602
##    ---                                                                    
## 12035:      ZYG11A     0.803083617       0.67979591 -0.240126667 0.2449006
## 12036:      ZYG11B    -0.221037393       0.73607048 -0.089140000 0.7005601
## 12037:         ZYX     0.270695732       0.09659020 -0.160966667 0.4223581
## 12038:       ZZEF1    -0.143581918       0.71563471 -0.008093333 0.9428464
## 12039:        ZZZ3     0.112509648       0.82455565 -0.220350000 0.3292576

Next, we annotated the resulting data table according to if they were predicted targets of miR-484 or both miR-93 and miR-484.

finDT$target <- "NULL"
targs484 <- Final[Final$classification == "miR-484" | 
                  Final$classification == "miR-93-3p & miR-484" |
                  Final$classification == "miR-93-5p & miR-484" | 
                  Final$classification == "miR-93-3p & miR-93-5p & miR-484",]
finDT[(finDT$Gene.symbol %in% targs484$Gene.symbol),]$target <- "consensus match"
fintarget <- finDT[finDT$`OS24-PS24_logFC` < 0 & finDT$logFC < 0,]
fintarget$classification <- "miR484 target"
notarget <- finDT[!(finDT$`OS24-PS24_logFC` < 0 & finDT$logFC < 0),]
notarget$classification <- "no target"
finDT <- rbind(fintarget, notarget)
finDT2 <- finDT[!duplicated(finDT$Gene.symbol),]
finDT2

##        Gene.symbol OS24-PS24_logFC OS24-PS24_PValue       logFC   P.Value          target classification
##     1:        A1BG     -0.04370499       0.97806237 -0.23466667 0.2455293 consensus match  miR484 target
##     2:         A2M     -0.20949434       0.91535537 -0.23571000 0.6905679 consensus match  miR484 target
##     3:      A4GALT     -0.93907892       0.03600162 -0.06777000 0.8731722            NULL  miR484 target
##     4:        AARS     -0.00616604       0.98052221 -0.01750667 0.9276419 consensus match  miR484 target
##     5:        AATK     -0.09304133       0.95853392 -0.03175000 0.9519064 consensus match  miR484 target
##    ---                                                                                                  
## 12035:       ZWINT      0.07040942       0.95754742 -0.10663333 0.4989971            NULL      no target
## 12036:        ZXDB      0.03332474       0.96848205  0.18894333 0.7786269            NULL      no target
## 12037:      ZYG11A      0.80308362       0.67979591 -0.24012667 0.2449006 consensus match      no target
## 12038:         ZYX      0.27069573       0.09659020 -0.16096667 0.4223581 consensus match      no target
## 12039:        ZZZ3      0.11250965       0.82455565 -0.22035000 0.3292576            NULL      no target

Finally, we generated a plot to visualize results. The point highlighted in red is eNOS.

ggplot(finDT2, aes(`OS24-PS24_logFC`, logFC)) + #, colour = group
  geom_point(shape = 21, size = 3, colour = "black", fill = "gray"
  ) + 
  theme_pubr() +
  labs(x = "OS/PS log2(FC)", y = "miR-484 overexpression log2(FC)") +
  scale_color_manual(values=c("#E31A1C")) +
  geom_point(data = subset(finDT, classification == "miR484 target"),
             aes(x=`OS24-PS24_logFC`, y=logFC, colour = classification), 
             shape = 21, size = 3, colour = "black", fill = "#08519C") +
  geom_point(data = subset(finDT, Gene.symbol == "NOS3"),
             aes(x=`OS24-PS24_logFC`, y=logFC), shape = 21, size = 3, colour = "black", fill = "#E31A1C")

Explore miR-93 putative targets

Once the PS/OS datasets were annotated with the predicted miR-93 and miR-484 targets, they were merged with the miR-93 overexpression data.
First we subsetted both the miR-93 fold change data and shear data table (labeled PSOS in the code) to include only targets identified in both data tables. Then the two data tables were merged together.

miR93 <- rawmerge[,c(3, 5, 8)] 
setnames(miR93, colnames(miR93), c("Gene.symbol", "logFC", "P.Value"))
setnames(PSOS, colnames(PSOS), c("Gene.symbol", "PS24-OS24_logFC", "PS24-OS24_PValue"))
finDT <- merge(PSOS, miR93, by = "Gene.symbol")
finDT$`PS24-OS24_logFC` <- finDT$`PS24-OS24_logFC` * -1
setnames(finDT, colnames(finDT), c("Gene.symbol", "OS24-PS24_logFC", "OS24-PS24_PValue", "logFC", "P.Value"))
finDT

##        Gene.symbol OS24-PS24_logFC OS24-PS24_PValue       logFC   P.Value
##     1:      A4GALT     -0.93907892       0.03600162 -0.18787262 0.8128323
##     2:        AAAS      0.05599731       0.95732985 -0.02209726 0.9551142
##     3:        AACS      0.04103272       0.93041781 -0.13603877 0.8438917
##     4:       AAGAB      0.16755632       0.63033882 -0.04423256 0.9062805
##     5:        AAK1      0.07969348       0.89380589 -0.05550181 0.9178856
##    ---                                                                   
## 10449:      ZYG11A      0.80308362       0.67979591 -0.22762969 0.8642075
## 10450:      ZYG11B     -0.22103739       0.73607048 -0.13578960 0.7038547
## 10451:         ZYX      0.27069573       0.09659020  0.27357950 0.3256346
## 10452:       ZZEF1     -0.14358192       0.71563471 -0.25029822 0.6948852
## 10453:        ZZZ3      0.11250965       0.82455565  0.13418417 0.7326937

Next, we annotated the resulting data table according to if they were predicted targets of miR-93 or both miR-93 and miR-484.

finDT$target <- "NULL"
p35targs <- Final[Final$classification == "miR-93-5p" | 
                  Final$classification == "miR-93-3p" | 
                  Final$classification == "miR-93-3p & miR-93-5p" | 
                  Final$classification == "miR-93-3p & miR-484" | 
                  Final$classification == "miR-93-5p & miR-484" | 
                  Final$classification == "miR-93-3p & miR-93-5p & miR-484",]
finDT[(finDT$Gene.symbol %in% p35targs$Gene.symbol),]$target <- "consensus match"
fintarget <- finDT[finDT$`OS24-PS24_logFC` < 0 & finDT$logFC < 0,]
fintarget$classification <- "miR93 target"
notarget <- finDT[!(finDT$`OS24-PS24_logFC` < 0 & finDT$logFC < 0),]
notarget$classification <- "no target"
finDT <- rbind(fintarget, notarget)
finDT2 <- finDT[!duplicated(finDT$Gene.symbol),]
finDT2

##        Gene.symbol OS24-PS24_logFC OS24-PS24_PValue       logFC    P.Value          target classification
##     1:      A4GALT     -0.93907892       0.03600162 -0.18787262 0.81283228            NULL   miR93 target
##     2:        AARS     -0.00616604       0.98052221 -0.07430367 0.77795499 consensus match   miR93 target
##     3:       ABCA1     -0.66064342       0.71612309 -0.16097283 0.91175318 consensus match   miR93 target
##     4:       ABCA2     -1.21810660       0.06897123 -1.06257427 0.13471243 consensus match   miR93 target
##     5:       ABCA3     -0.69067000       0.37125665 -0.64628215 0.05635528 consensus match   miR93 target
##    ---                                                                                                   
## 10440:        ZXDA     -0.07751632       0.95563444  0.18839355 0.85671216            NULL      no target
## 10441:        ZXDB      0.03332474       0.96848205  0.01381112 0.98091095            NULL      no target
## 10442:      ZYG11A      0.80308362       0.67979591 -0.22762969 0.86420745 consensus match      no target
## 10443:         ZYX      0.27069573       0.09659020  0.27357950 0.32563462 consensus match      no target
## 10444:        ZZZ3      0.11250965       0.82455565  0.13418417 0.73269368            NULL      no target

Finally, we generated a plot to visualize results. The points highlighted in red are eNOS and KLF2.

ggplot(finDT2, aes(`OS24-PS24_logFC`, logFC)) + #, colour = group
  geom_point(shape = 21, size = 3, colour = "black", fill = "gray") + 
  theme_pubr() +
  labs(x = "OS/PS log2(FC)", y = "miR-484 overexpression log2(FC)") +
  scale_color_manual(values=c("#E31A1C")) + 
  geom_point(data = subset(finDT, classification == "miR93 target"),
             aes(x=`OS24-PS24_logFC`, y=logFC, colour = classification), 
             shape = 21, size = 3, colour = "black", fill = "#08519C") +
  geom_point(data = subset(finDT, Gene.symbol == "KLF2"),
             aes(x=`OS24-PS24_logFC`, y=logFC), 
             shape = 21, size = 3, colour = "black", fill = "#E31A1C") +
  geom_point(data = subset(finDT, Gene.symbol == "NOS3"),
             aes(x=`OS24-PS24_logFC`, y=logFC), 
             shape = 21, size = 3, colour = "black", fill = "#E31A1C")

Section III: Identify additional miRNAs NCL may regulate

In this section, we mined the predicted NCL regulated miRs and expression profiles of miRs under PS or OS and to find miRs that are both differentially regulated by shear stress in the endothelium and may require NCL for maturation. This was done in two steps. In the first step, miR expression profiles were downloaded from Gene Expression Omnibus. In the second step, predicted NCL regulated miRs were annotated with the expression profiles and only miRs that were significantly differentially expressed were retained.

Download shear regulated miR expression from GEO

This portion of code queries and downloads Gene Expression Omnibibus for miRs that are differentially regulated by shear stress from the GEO accession number GSE26953. The results of this code are illustrated below.

# load series and platform data from GEO
gset <- getGEO("GSE26953", GSEMatrix =TRUE, AnnotGPL=FALSE)
if (length(gset) > 1) idx <- grep("GPL8179", attr(gset, "names")) else idx <- 1
gset <- gset[[idx]]
# make proper column names to match toptable 
fvarLabels(gset) <- make.names(fvarLabels(gset))
# group names for all samples
gsms <- "000000XXXXXX111111XXXXXX"
sml <- c()
for (i in 1:nchar(gsms)) { sml[i] <- substr(gsms,i,i) }
# eliminate samples marked as "X"
sel <- which(sml != "X")
sml <- sml[sel]
gset <- gset[ ,sel]
# log2 transform
ex <- exprs(gset)
qx <- as.numeric(quantile(ex, c(0., 0.25, 0.5, 0.75, 0.99, 1.0), na.rm=T))
LogC <- (qx[5] > 100) ||
  (qx[6]-qx[1] > 50 && qx[2] > 0) ||
  (qx[2] > 0 && qx[2] < 1 && qx[4] > 1 && qx[4] < 2)
if (LogC) { ex[which(ex <= 0)] <- NaN
exprs(gset) <- log2(ex) }
# set up the data and proceed with analysis
sml <- paste("G", sml, sep="")    # set group names
fl <- as.factor(sml)
gset$description <- fl
design <- model.matrix(~ description + 0, gset)
colnames(design) <- levels(fl)
fit <- lmFit(gset, design)
cont.matrix <- makeContrasts(G1-G0, levels=design)
fit2 <- contrasts.fit(fit, cont.matrix)
fit2 <- eBayes(fit2, 0.01)
tT <- topTable(fit2, adjust="fdr", sort.by="B", number=2500000000000)

FC <- subset(tT, select=c("ID","adj.P.Val","P.Value","t","B","logFC","miRNA_ID"))
FC

##                 ID  adj.P.Val      P.Value            t          B         logFC                    miRNA_ID
##    1: ILMN_3168684 0.08693411 8.218184e-05  5.895432269  1.3595844  0.7083580037              hsa-miR-26a-1*
##    2: ILMN_3168858 0.08693411 2.004817e-04 -5.322242670  0.6875645 -0.8370094056                hsa-miR-23b*
##    3: ILMN_3167240 0.08693411 2.277750e-04 -5.242563976  0.5893538 -1.2645767136              hsa-miR-486-5p
##    4: ILMN_3168616 0.34886620 1.563259e-03  4.099644770 -0.9449975  0.9039965897              hsa-miR-92a-1*
##    5: ILMN_3168681 0.34886620 1.721573e-03  4.044872832 -1.0240262  0.4359113156               hsa-miR-106b*
##   ---                                                                                                       
## 1141: ILMN_3168591 0.99445704 9.909830e-01  0.011546184 -5.7536583  0.0027169817                 hsa-miR-93*
## 1142: ILMN_3168528 0.99474274 9.921364e-01  0.010069117 -5.7536739  0.0006708245                 hsa-miR-562
## 1143: ILMN_3168499 0.99527851 9.935400e-01  0.008271783 -5.7536900  0.0007504537 hsa-miR-518a-5p,hsa-miR-527
## 1144: ILMN_3168756 0.99541842 9.945491e-01  0.006979741 -5.7536997  0.0004231339                hsa-miR-144*
## 1145: ILMN_3168800 0.99719223 9.971922e-01  0.003595222 -5.7537172  0.0025484069                hsa-miR-379*

Annotate predicted NCL regulated miRs with shear Fold changes

In this section, shear differentially regulated miR data and predicted NCL regulated miRs were used to determine which miRs that respond to shear may be regulated by NCL. In the first steps of this process, we cleaned the shear regulated miRs data table. This was done in two steps. First the miR names were cleaned.

FC <- fread("./data/36 miR fold changes.xls")
FC <- FC[,c("miRNA_ID","logFC", "P.Value", "adj.P.Val")]
#### reformat miR names
FC$miRNA_ID <- toupper(FC$miRNA_ID)
FC$miRNA_ID <- gsub("HSA-", "", FC$miRNA_ID, fixed = TRUE)
FC$miRNA_ID <- gsub("*", "", FC$miRNA_ID, fixed = TRUE)
FC$miRNA_ID <- gsub("-5P", "", FC$miRNA_ID, fixed = TRUE)
FC$miRNA_ID <- gsub("-3P", "", FC$miRNA_ID, fixed = TRUE)
FC <- FC[!FC$miRNA_ID == "",]
setnames(FC, colnames(FC), c("gene_symbol", "OS/PS_logFC", "P.Value", "adj.P.Val"))
FC

##           gene_symbol   OS/PS_logFC      P.Value  adj.P.Val
##   1:        MIR-26A-1  0.7083580037 8.218184e-05 0.08693411
##   2:          MIR-23B -0.8370094056 2.004817e-04 0.08693411
##   3:          MIR-486 -1.2645767136 2.277750e-04 0.08693411
##   4:        MIR-92A-1  0.9039965897 1.563259e-03 0.34886620
##   5:         MIR-106B  0.4359113156 1.721573e-03 0.34886620
##  ---                                                       
## 854:           MIR-93  0.0027169817 9.909830e-01 0.99445704
## 855:          MIR-562  0.0006708245 9.921364e-01 0.99474274
## 856: MIR-518A,MIR-527  0.0007504537 9.935400e-01 0.99527851
## 857:          MIR-144  0.0004231339 9.945491e-01 0.99541842
## 858:          MIR-379  0.0025484069 9.971922e-01 0.99719223

Then records annotated with two miR names were divided into their own records, each containing the miR name and corresponding information in the data table.

# adjust duplicated names 
DT <- NULL
names <- FC$gene_symbol
for(i in 1:length(names)){
  spl <- strsplit(names[i], split = ",")
  bound <- cbind(data.table(spl[[1]]), FC[i,2:ncol(FC)])
  setnames(bound, colnames(bound), c("gene_symbol", "OS/PS_logFC", "P.Value", "adj.P.Val"))
  DT <- rbind(DT, bound)
  
}
FC <- DT
FC

##      gene_symbol   OS/PS_logFC      P.Value  adj.P.Val
##   1:   MIR-26A-1  0.7083580037 8.218184e-05 0.08693411
##   2:     MIR-23B -0.8370094056 2.004817e-04 0.08693411
##   3:     MIR-486 -1.2645767136 2.277750e-04 0.08693411
##   4:   MIR-92A-1  0.9039965897 1.563259e-03 0.34886620
##   5:    MIR-106B  0.4359113156 1.721573e-03 0.34886620
##  ---                                                  
## 874:     MIR-562  0.0006708245 9.921364e-01 0.99474274
## 875:    MIR-518A  0.0007504537 9.935400e-01 0.99527851
## 876:     MIR-527  0.0007504537 9.935400e-01 0.99527851
## 877:     MIR-144  0.0004231339 9.945491e-01 0.99541842
## 878:     MIR-379  0.0025484069 9.971922e-01 0.99719223

After the miR fold change data table was cleaned, NCL was subsetted from the predicted miR binding protein data table.

NCL <- AMPK_targs_PSOS[AMPK_targs_PSOS$Species == "Human",c(1,6,10:12)]
NCL <- NCL[NCL$Targeting_Factor == "NCL",]

Finally, to determine which miRs that respond to shear NCL may regulate, miR fold changes were annotated with predicted NCL regulated miRs. In this new data table, miRs were selected with a p-value less than 0.05.

NCL_FC <- merge(FC, NCL, by = "gene_symbol", allow.cartesian = TRUE)
NCL_FC <- NCL_FC[!duplicated(NCL_FC$gene_symbol),]
NCL_FC <- NCL_FC[NCL_FC$P.Value < 0.05,]
NCL_FC

##     gene_symbol OS/PS_logFC      P.Value  adj.P.Val Targeting_Factor         Gene_ID Identified_sequence Species
##  1:      LET-7E  -0.6437136 0.0105411895 0.73652265              NCL ENSG00000115053                GguA   Human
##  2:    MIR-106B   0.4359113 0.0017215729 0.34886620              NCL ENSG00000115053                AguG   Human
##  3:     MIR-107   0.2875544 0.0447074587 0.97698082              NCL ENSG00000115053                AguG   Human
##  4:    MIR-133B  -0.2877096 0.0210871293 0.89425048              NCL ENSG00000115053                GguC   Human
##  5:     MIR-155   0.7752894 0.0341110361 0.97642841              NCL ENSG00000115053                UguU   Human
##  6:     MIR-187   1.1581162 0.0018281198 0.34886620              NCL ENSG00000115053                GguC   Human
##  7:     MIR-23B  -0.8370094 0.0002004817 0.08693411              NCL ENSG00000115053                GguG   Human
##  8:      MIR-25   0.4601918 0.0320489675 0.96825357              NCL ENSG00000115053                AguG   Human
##  9:     MIR-27B  -0.6752616 0.0067557295 0.68732332              NCL ENSG00000115053                GguG   Human
## 10:     MIR-30A  -0.2532950 0.0061748468 0.68732332              NCL ENSG00000115053                UguA   Human
## 11:     MIR-363  -0.2063610 0.0482834035 0.97698082              NCL ENSG00000115053                UguU   Human
## 12:     MIR-377  -0.1887817 0.0189578268 0.89425048              NCL ENSG00000115053                GguU   Human
## 13:     MIR-452   0.6215602 0.0187447464 0.89425048              NCL ENSG00000115053                UguU   Human
## 14:     MIR-505  -0.6893031 0.0276553744 0.96825357              NCL ENSG00000115053                AguG   Human
## 15:     MIR-582   0.5588665 0.0244112128 0.96382202              NCL ENSG00000115053                UguG   Human
## 16:     MIR-877   0.3985299 0.0067659924 0.68732332              NCL ENSG00000115053                 guA   Human
## 17:   MIR-92A-1   0.9039966 0.0015632590 0.34886620              NCL ENSG00000115053                GguU   Human

Section IV: Analyze serum levels of miR-93-3p, miR-93-5p, and miR-484

In this section, the serum levels of miR-93-5p, miR-93-3p, and miR-484 were determined in people with coronary artery disease (CAD) or healthy control individuals (CTRL). This process was done in several steps as putlined below.

Load data tables

cohort1 <- fread("./data/2019-1-12 combined taiwan data WSC 0113 update.csv")
cohort1 <- cohort1[,c(3:7)]
head(cohort1)

##    mir39_CT mir93-3p_CT mir93-5p_CT mir484_CT group
## 1: 24.57745    33.88671    30.31115  31.77536  CTRL
## 2: 27.41441    34.92678    34.30077  32.35014  CTRL
## 3: 25.23070    33.75857    30.92392  32.05500  CTRL
## 4: 23.60611    31.35246    27.55444  29.50591  CTRL
## 5: 22.56708    31.24483    26.91515  30.20167  CTRL
## 6: 24.37048    34.29940    29.35328  31.85543  CTRL

cohort2 <- fread("./data/2018-12-12 xian data.csv")
setnames(cohort2, colnames(cohort2), c("Sample Name", "mir39_CT", "mir93-3p_CT", "mir93-5p_CT", "mir484_CT",  "group"))
cohort2 <- cohort2[,c(2:6)]
head(cohort2)

##    mir39_CT mir93-3p_CT mir93-5p_CT mir484_CT group
## 1:    21.10       33.76       31.63     28.21  CTRL
## 2:    20.97       35.19       31.59     27.93  CTRL
## 3:    20.12       34.55       31.38     28.13  CTRL
## 4:    19.75       34.76       31.68     28.94  CTRL
## 5:    21.14       33.40       31.45     28.04  CTRL
## 6:    20.51       32.58       29.72     26.87  CTRL

calculate delta CT values for each miRNA relative to cel-miR-39 spike in control

In order to “normalize” fold changes to the reference gene, we used the delta CT method.

cohort1$delta_ct_93_3p <- 2^(cohort1$mir39_CT - cohort1$`mir93-3p_CT`)
cohort1$delta_ct_93_5p <- 2^(cohort1$mir39_CT - cohort1$`mir93-5p_CT`)
cohort1$delta_ct_484 <- 2^(cohort1$mir39_CT - cohort1$mir484_CT)
cohort1 <- cohort1[,c(1:4, 6:8, 5)]
head(cohort1)

##    mir39_CT mir93-3p_CT mir93-5p_CT mir484_CT delta_ct_93_3p delta_ct_93_5p delta_ct_484 group
## 1: 24.57745    33.88671    30.31115  31.77536    0.001576274    0.018792462  0.006811006  CTRL
## 2: 27.41441    34.92678    34.30077  32.35014    0.005477117    0.008452767  0.032673609  CTRL
## 3: 25.23070    33.75857    30.92392  32.05500    0.002709284    0.019327266  0.008824273  CTRL
## 4: 23.60611    31.35246    27.55444  29.50591    0.004657115    0.064779157  0.016748741  CTRL
## 5: 22.56708    31.24483    26.91515  30.20167    0.002441945    0.049102262  0.005032229  CTRL
## 6: 24.37048    34.29940    29.35328  31.85543    0.001025875    0.031624673  0.005582182  CTRL

cohort2$delta_ct_93_3p <- 2^(cohort2$mir39_CT - cohort2$`mir93-3p_CT`)
cohort2$delta_ct_93_5p <- 2^(cohort2$mir39_CT - cohort2$`mir93-5p_CT`)
cohort2$delta_ct_484 <- 2^(cohort2$mir39_CT - cohort2$mir484_CT)
cohort2 <- cohort2[,c(1:4, 6:8, 5)]
head(cohort2)

##    mir39_CT mir93-3p_CT mir93-5p_CT mir484_CT delta_ct_93_3p delta_ct_93_5p delta_ct_484 group
## 1:    21.10       33.76       31.63     28.21   1.545113e-04   0.0006763230  0.007238969  CTRL
## 2:    20.97       35.19       31.59     27.93   5.240268e-05   0.0006354208  0.008032139  CTRL
## 3:    20.12       34.55       31.38     28.13   4.530406e-05   0.0004077578  0.003879268  CTRL
## 4:    19.75       34.76       31.68     28.94   3.030678e-05   0.0002562785  0.001712121  CTRL
## 5:    21.14       33.40       31.45     28.04   2.038789e-04   0.0007877361  0.008373230  CTRL
## 6:    20.51       32.58       29.72     26.87   2.325776e-04   0.0016885493  0.012174447  CTRL

calculate fold changes

Fold Changes were calculated based on the average value of the control sample for each miR. This was done in two steps. In the first step, the average delta CT value was calculated for each miR in the control samples. In the second step, the fold changes were computed based on this average value.

ctrl <- cohort1[cohort1$group == "CTRL",]
ctrl <- ctrl[complete.cases(ctrl)]
ave3p <- ave(ctrl$delta_ct_93_3p)[1]
ave5p <- ave(ctrl$delta_ct_93_5p)[1]
ave484 <- ave(ctrl$delta_ct_484)[1]
#### compute fold changes ####
cohort1$FC3p <- cohort1$delta_ct_93_3p/ave3p
cohort1$FC5p <- cohort1$delta_ct_93_5p/ave5p
cohort1$FC484 <- cohort1$delta_ct_484/ave484
cohort1 <- cohort1[,c(5:7, 9:11, 8)]
head(cohort1)

##    delta_ct_93_3p delta_ct_93_5p delta_ct_484      FC3p      FC5p     FC484 group
## 1:    0.001576274    0.018792462  0.006811006 0.6458062 1.8639916 0.3888060  CTRL
## 2:    0.005477117    0.008452767  0.032673609 2.2439975 0.8384152 1.8651718  CTRL
## 3:    0.002709284    0.019327266  0.008824273 1.1100050 1.9170378 0.5037333  CTRL
## 4:    0.004657115    0.064779157  0.016748741 1.9080393 6.4253316 0.9561013  CTRL
## 5:    0.002441945    0.049102262  0.005032229 1.0004748 4.8703677 0.2872646  CTRL
## 6:    0.001025875    0.031624673  0.005582182 0.4203054 3.1367962 0.3186586  CTRL

ctrl <- cohort2[cohort2$group == "CTRL",]
ctrl <- ctrl[complete.cases(ctrl)]
ave3p <- ave(ctrl$delta_ct_93_3p)[1]
ave5p <- ave(ctrl$delta_ct_93_5p)[1]
ave484 <- ave(ctrl$delta_ct_484)[1]
#### compute fold changes ####
cohort2$FC3p <- cohort2$delta_ct_93_3p/ave3p
cohort2$FC5p <- cohort2$delta_ct_93_5p/ave5p
cohort2$FC484 <- cohort2$delta_ct_484/ave484
cohort2 <- cohort2[,c(5:7, 9:11, 8)]
head(cohort2)

##    delta_ct_93_3p delta_ct_93_5p delta_ct_484      FC3p      FC5p     FC484 group
## 1:   1.545113e-04   0.0006763230  0.007238969 1.9620825 0.9071049 1.3854664  CTRL
## 2:   5.240268e-05   0.0006354208  0.008032139 0.6654424 0.8522456 1.5372713  CTRL
## 3:   4.530406e-05   0.0004077578  0.003879268 0.5752997 0.5468971 0.7424531  CTRL
## 4:   3.030678e-05   0.0002562785  0.001712121 0.3848547 0.3437285 0.3276828  CTRL
## 5:   2.038789e-04   0.0007877361  0.008373230 2.5889834 1.0565355 1.6025527  CTRL
## 6:   2.325776e-04   0.0016885493  0.012174447 2.9534183 2.2647334 2.3300675  CTRL

take the log2 of the FC values

This data was not normally distributed. Therefore, we calculated the log2 of the fold changes to create a more “normal” distribution of the data.

cohort1$log2FC_93_3p <- log2(cohort1$FC3p)
cohort1$log2FC_93_5p <- log2(cohort1$FC5p)
cohort1$log2FC_484 <- log2(cohort1$FC484)
cohort1 <- cohort1[,c(1:6,8:10, 7)]
head(cohort1)

##    delta_ct_93_3p delta_ct_93_5p delta_ct_484      FC3p      FC5p     FC484  log2FC_93_3p log2FC_93_5p  log2FC_484 group
## 1:    0.001576274    0.018792462  0.006811006 0.6458062 1.8639916 0.3888060 -0.6308268124    0.8983953 -1.36287756  CTRL
## 2:    0.005477117    0.008452767  0.032673609 2.2439975 0.8384152 1.8651718  1.1660710776   -0.2542631  0.89930849  CTRL
## 3:    0.002709284    0.019327266  0.008824273 1.1100050 1.9170378 0.5037333  0.1505662376    0.9388788 -0.98926802  CTRL
## 4:    0.004657115    0.064779157  0.016748741 1.9080393 6.4253316 0.9561013  0.9320909076    2.6837709 -0.06476469  CTRL
## 5:    0.002441945    0.049102262  0.005032229 1.0004748 4.8703677 0.2872646  0.0006848776    2.2840307 -1.79954787  CTRL
## 6:    0.001025875    0.031624673  0.005582182 0.4203054 3.1367962 0.3186586 -1.2504900424    1.6492918 -1.64991636  CTRL

cohort2$log2FC_93_3p <- log2(cohort2$FC3p)
cohort2$log2FC_93_5p <- log2(cohort2$FC5p)
cohort2$log2FC_484 <- log2(cohort2$FC484)
cohort2 <- cohort2[,c(1:6,8:10, 7)]
head(cohort2)

##    delta_ct_93_3p delta_ct_93_5p delta_ct_484      FC3p      FC5p     FC484 log2FC_93_3p log2FC_93_5p log2FC_484 group
## 1:   1.545113e-04   0.0006763230  0.007238969 1.9620825 0.9071049 1.3854664    0.9723857  -0.14065877  0.4703718  CTRL
## 2:   5.240268e-05   0.0006354208  0.008032139 0.6654424 0.8522456 1.5372713   -0.5876143  -0.23065877  0.6203718  CTRL
## 3:   4.530406e-05   0.0004077578  0.003879268 0.5752997 0.5468971 0.7424531   -0.7976143  -0.87065877 -0.4296282  CTRL
## 4:   3.030678e-05   0.0002562785  0.001712121 0.3848547 0.3437285 0.3276828   -1.3776143  -1.54065877 -1.6096282  CTRL
## 5:   2.038789e-04   0.0007877361  0.008373230 2.5889834 1.0565355 1.6025527    1.3723857   0.07934123  0.6803718  CTRL
## 6:   2.325776e-04   0.0016885493  0.012174447 2.9534183 2.2647334 2.3300675    1.5623857   1.17934123  1.2203718  CTRL

melt Data table to make a graph

Graphs for cohort 1:

DT2 <- cohort1[,c(7:10)]
DT_melt <- melt(DT2, id.vars = c("group"),measure.vars = c("log2FC_93_3p", "log2FC_93_5p", "log2FC_484"))
DT_melt <- DT_melt[DT_melt$value < 100,]

p1 <- ggplot(DT_melt, aes(x= `variable`, y= `value`, fill = group)) +
  geom_dotplot(binaxis='y', stackdir='center', 
               position=position_dodge(0.8), binwidth = 0.2) + 
  ylab("Fold change") +
  xlab("microRNA") +
  ggtitle("cohort1") +
  theme_pubr()

p2 <- ggplot(DT_melt, aes(x= `variable`, y= `value`, fill = group)) +
  geom_boxplot(position=position_dodge(0.8)) +
  ylab("Fold change") +
  xlab("microRNA") +
  ggtitle("cohort1") +
  theme_pubr()

p1;p2

Graphs for cohort 2:

DT2 <- cohort2[,c(7:10)]
DT_melt <- melt(DT2, id.vars = c("group"),measure.vars = c("log2FC_93_3p", "log2FC_93_5p", "log2FC_484"))
DT_melt <- DT_melt[DT_melt$value < 100,]

p1 <- ggplot(DT_melt, aes(x= `variable`, y= `value`, fill = group)) +
  geom_dotplot(binaxis='y', stackdir='center', 
               position=position_dodge(0.8), binwidth = 0.2) + 
  ylab("Fold change") +
  xlab("microRNA") +
  ggtitle("cohort2") +
  theme_pubr()

p2 <- ggplot(DT_melt, aes(x= `variable`, y= `value`, fill = group)) +
  geom_boxplot(position=position_dodge(0.8)) +
  ylab("Fold change") +
  xlab("microRNA") +
  ggtitle("cohort2") +
  theme_pubr()

p1;p2

Statistical analysis

In this section, we compared the control to the treatment groupd for each miR. Because the data was not normally distributed, we used Kruskal-Wallis test followed by a Mann-Whitney U non-parametric tests.

Stats for cohort 1:

CTRL <- cohort1[cohort1$group == "CTRL",]
CAD <- cohort1[cohort1$group == "CAD",]
wilcox.test(CTRL$log2FC_93_3p ,CAD$log2FC_93_3p, conf.int = TRUE, 
            paired = FALSE, formula = "lhs")

## 
##  Wilcoxon rank sum test
## 
## data:  CTRL$log2FC_93_3p and CAD$log2FC_93_3p
## W = 933, p-value = 0.4489
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
##  -1.3456787  0.5814148
## sample estimates:
## difference in location 
##             -0.3927225

wilcox.test(CTRL$log2FC_93_5p ,CAD$log2FC_93_5p, conf.int = TRUE, 
            paired = FALSE, formula = "lhs")

## 
##  Wilcoxon rank sum test
## 
## data:  CTRL$log2FC_93_5p and CAD$log2FC_93_5p
## W = 591, p-value = 0.0002325
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
##  -3.1296677 -0.9367714
## sample estimates:
## difference in location 
##              -2.134306

wilcox.test(CTRL$log2FC_484 ,CAD$log2FC_484, conf.int = TRUE, 
            paired = FALSE, formula = "lhs")

## 
##  Wilcoxon rank sum test
## 
## data:  CTRL$log2FC_484 and CAD$log2FC_484
## W = 561, p-value = 8.39e-05
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
##  -3.671774 -1.307429
## sample estimates:
## difference in location 
##              -2.697787

stats for cohort 2:

CTRL <- cohort2[cohort2$group == "CTRL",]
CAD <- cohort2[cohort2$group == "CAD",]
wilcox.test(CTRL$log2FC_93_3p ,CAD$log2FC_93_3p, conf.int = TRUE, 
            paired = FALSE, formula = "lhs")

## 
##  Wilcoxon rank sum test
## 
## data:  CTRL$log2FC_93_3p and CAD$log2FC_93_3p
## W = 296, p-value = 0.005525
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
##  -1.87 -0.32
## sample estimates:
## difference in location 
##                  -1.12

wilcox.test(CTRL$log2FC_93_5p ,CAD$log2FC_93_5p, conf.int = TRUE, 
            paired = FALSE, formula = "lhs")

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  CTRL$log2FC_93_5p and CAD$log2FC_93_5p
## W = 140, p-value = 1.021e-06
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
##  -2.50006 -1.17003
## sample estimates:
## difference in location 
##              -1.914086

wilcox.test(CTRL$log2FC_484 ,CAD$log2FC_484, conf.int = TRUE, 
            paired = FALSE, formula = "lhs")

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  CTRL$log2FC_484 and CAD$log2FC_484
## W = 186.5, p-value = 2.155e-05
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
##  -2.0199425 -0.8300589
## sample estimates:
## difference in location 
##              -1.376852

combine data together and calculate the correlation between miRs

cohort1$dataset <- "cohort1"
cohort2$dataset <- "cohort2"
combined <- rbind(cohort1, cohort2)
combined2 <- combined[,7:11]
combined_comp <- combined2[complete.cases(combined2),]
head(combined[,c(7:11)])

##     log2FC_93_3p log2FC_93_5p  log2FC_484 group dataset
## 1: -0.6308268124    0.8983953 -1.36287756  CTRL cohort1
## 2:  1.1660710776   -0.2542631  0.89930849  CTRL cohort1
## 3:  0.1505662376    0.9388788 -0.98926802  CTRL cohort1
## 4:  0.9320909076    2.6837709 -0.06476469  CTRL cohort1
## 5:  0.0006848776    2.2840307 -1.79954787  CTRL cohort1
## 6: -1.2504900424    1.6492918 -1.64991636  CTRL cohort1

p <- ggplot(combined, aes(x=log2FC_93_3p, y=log2FC_93_5p))+
  geom_point(shape = 21, size = 3, colour = "black", fill = "#08519C")+
  scale_color_manual(values=c("#E31A1C")) +
  geom_point(data = subset(combined, combined$dataset == "cohort2"),
             aes(x=log2FC_93_3p, y=log2FC_93_5p), 
             shape = 21, size = 3, colour = "black", fill = "#E31A1C") +
  geom_smooth(method = "lm")+
  ggtitle("miR-93-5p miR-93-3p correlation")+
  theme_pubr(); p

cor.test(combined_comp$log2FC_93_3p, combined_comp$log2FC_93_5p, method=c("pearson"))

## 
##  Pearson's product-moment correlation
## 
## data:  combined_comp$log2FC_93_3p and combined_comp$log2FC_93_5p
## t = 11.198, df = 152, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5753789 0.7506865
## sample estimates:
##       cor 
## 0.6723538

p <- ggplot(data = combined, mapping = aes(x=combined$log2FC_93_3p, y=combined$log2FC_484))+
  geom_point(shape = 21, size = 3, colour = "black", fill = "#08519C")+
  geom_smooth(method = "lm")+
  geom_point(data = subset(combined, combined$dataset == "cohort2"),
             aes(x=log2FC_93_3p, y=log2FC_484), 
             shape = 21, size = 3, colour = "black", fill = "#E31A1C") +
  ggtitle("miR-484 miR-93-3p correlation")+
  theme_pubr(); p

cor.test(combined_comp$log2FC_93_3p, combined_comp$log2FC_484, method=c("pearson"))

## 
##  Pearson's product-moment correlation
## 
## data:  combined_comp$log2FC_93_3p and combined_comp$log2FC_484
## t = 12.693, df = 152, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6307238 0.7862817
## sample estimates:
##      cor 
## 0.717327

p <- ggplot(data = combined, mapping = aes(x=log2FC_93_5p, y=log2FC_484))+
  geom_point(shape = 21, size = 3, colour = "black", fill = "#08519C")+
  geom_point(data = subset(combined, combined$dataset == "cohort2"),
             aes(x=log2FC_93_5p, y=log2FC_484), 
             shape = 21, size = 3, colour = "black", fill = "#E31A1C") +
  geom_smooth(method = "lm")+
  ggtitle("miR-484 miR-93-5p correlation")+
  theme_pubr(); p

cor.test(combined_comp$log2FC_93_5p, combined_comp$log2FC_484, method=c("pearson"))

## 
##  Pearson's product-moment correlation
## 
## data:  combined_comp$log2FC_93_5p and combined_comp$log2FC_484
## t = 14.749, df = 152, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6932165 0.8252700
## sample estimates:
##       cor 
## 0.7672556

References

Gongol B, Marin T, Peng IC, Woo B, Martin M, King S, Sun W, Johnson DA, Chien S, Shyy JY (2013) AMPK??2 exerts its anti-inflammatory effects through PARP-1 and Bcl-6. Proc Natl Acad Sci U S A 110:3161-6. https://www.ncbi.nlm.nih.gov/pubmed/23382195
Shang F, Zhang J, Li Z, Zhang J, Yin Y, Wang Y, Marin TL, Gongol B, Xiao H, Zhang YY, Chen Z, Shyy JY, Lei T (2016) Cardiovascular Protective Effect of Metformin and Telmisartan: Reduction of PARP1 Activity via the AMPK-PARP1 Cascade. PLoS One 11:e0151845. https://www.ncbi.nlm.nih.gov/pubmed/26986624
Shentu TP, He M, Sun X, Zhang J, Zhang F, Gongol B, Marin TL, Zhang J, Wen L, Wang Y, Geary GG, Zhu Y, Johnson DA, Shyy JY (2016) AMP-Activated Protein Kinase and Sirtuin 1 Coregulation of Cortactin Contributes to Endothelial Function. Arterioscler Thromb Vasc Biol 36:2358-2368. https://www.ncbi.nlm.nih.gov/pubmed/27758765
Marin TL, Gongol B, Zhang F, Martin M, Johnson DA, Xiao H, Wang Y, Subramaniam S, Chien S, Shyy JY (2017) AMPK promotes mitochondrial biogenesis and function by phosphorylating the epigenetic factors DNMT1, RBBP7, and HAT1. Sci Signal 31;10(464). https://www.ncbi.nlm.nih.gov/pubmed/28143904
Pickering BF, Yu D, Van Dyke MW (2011) Nucleolin protein interacts with microprocessor complex to affect biogenesis of microRNAs 15a and 16. J Biol Chem 286:44095-103. https://www.ncbi.nlm.nih.gov/pubmed/22049078
Ghisolfi-Nieto L, Joseph G, Puvion-Dutilleul F, Amalric F, Bouvet P (1996) Nucleolin is a sequence-specific RNA-binding protein: characterization of targets on pre-ribosomal RNA. J Mol Biol 260:34-53. https://www.ncbi.nlm.nih.gov/pubmed/8676391
Gongol B, Marin T, Johnson DA, Shyy JY (2018) Bioinformatics Approach to Identify Novel AMPK Targets. Methods Mol Biol 1732:99-109. https://www.ncbi.nlm.nih.gov/pubmed/29480471
Marin TL, Gongol B, Martin M, King SJ, Smith L, Johnson DA, Subramaniam S, Chien S, Shyy JY (2015) Identification of AMP-activated protein kinase targets by a consensus sequence search of the proteome. BMC Syst Biol 11;9:13. https://www.ncbi.nlm.nih.gov/pubmed/25890336
Ajami NE, Gupta S, Maurya MR, Nguyen P, Li JY, Shyy JY, Chen Z, Chien S, Subramaniam S (2017) Systems biology analysis of longitudinal functional response of endothelial cells to shear stress. Proc Natl Acad Sci USA 114:10990-10995. https://www.ncbi.nlm.nih.gov/pubmed/28973892
Schirle NT, Sheu-Gruttadauria J, MacRae IJ. (2014) Structural basis for microRNA targeting. Science. 346:608-13. https://www.ncbi.nlm.nih.gov/pubmed/25359968
Back to top

miRNA interacting protein analysis

Brendan Gongol, Traci Marin

Janurary 9, 2019

Table of contents