This documentation provides a detailed description of the functions and workthrough used for the bioinformatic prediction of microRNA (miR) binding proteins that may regulate the maturation of miRs in the endothelium in response to Pulsatile (PS) or Oscillatory (OS) shear stress. This analysis accompanies the publication titled “Shear Stress Regulation of miR-93 and miR-484 Maturation through Nucleolin” and reproduces graphs and tables contained in the publication.
This analysis is divided into four sections: Section I involves identifying RNA binding proteins that potentially regulate the maturation of microRNAs (miR) in the endothelium as well as which ones of them are regulated by AMP activated protein kinase (AMPK), a well established mediator or energy homeostasis and mechano-signaling pathway that prevents endothelial dysfunction (1-4). This analysis identifies NCL as an important regulator of miR-93 and miR-484 biogenesis. Section II involves identifying endothelial transcripts that are targeted by miRs regulated by miR binding proteins identified in section I. In this analysis, we foculed on nucleolin (NCL) and its regulation of miR-93 and miR-484. Therefore, in this section, we identified targets of both miR-93 and miR-484, establishing them as miRs that are governed by shear stress and confer endothelial dycfunction by targeting KLF2 and eNOS, two markers of endothelial homeostasis. Section III involves mining online databases for the identification of miRs NCL putatively regulates. Finally, Section IV details the steps and statistical methods used in the validation of miR-93 and miR-484 in the serum of people with coronray artery disease (CAD) compared to healthy controls (CTRL).
In this section, all libraries used in the analysis are loaded and the working directory is set (excluded from view). Additionally, the organism and BioMartdataset objects are assigned which are used later in the analysis to query BioMart for annotation information. Functions for this analysis can be found at the following github repository: https://github.com/brengong/ConservationtextmineR
devtools::install_github('brengong/ConservationtextmineR')
#### load packages ####
suppressMessages(library(data.table))
suppressMessages(library(tidyverse))
suppressMessages(library(readxl))
suppressMessages(library(ggpubr))
suppressMessages(library(dplyr))
suppressMessages(library(dtplyr))
suppressMessages(library(purrr))
suppressMessages(library(stringr))
suppressMessages(library(gridExtra))
suppressMessages(library(caret))
suppressMessages(library(plotly))
suppressMessages(library(biomaRt))
suppressMessages(library(systemPipeR))
suppressMessages(library(GenomicFeatures))
suppressMessages(library(BiocParallel))
suppressMessages(library(ConservationtextmineR))
#### set up parameters ####
Organ <- "Homo sapiens"
BioMartdataset <-"hsapiens_gene_ensembl"
In this section, we used data obtained from the A daTabase of RNA binding proteins and AssoCiated moTifs (ATtRACT) data base (https://attract.cnic.es/) to obtain a comprehensive list of RNA binding proteins and their cognate binding motifs. The link to download the data is located at: https://attract.cnic.es/attract/static/download/download_db/dd088b2b-8dd3-4e31-8e73-e1b866983a1e.txt. This information is ultimately used to determine which RNA binding proteins can bind to miR sequences. In addition to this database. NCL is an additional RNA binding protein identified based on literature searches (5-6) and was included in the ultimate table. Both the direct and indirect targeting information for NCL was included.
DT <- fread("./data/1 RRM protein database.csv")
DTsub <- DT[,c("Gene ID", "Gene Name", "Organism", "Motif")]
DTsub <- DTsub[DTsub$Organism == "Homo_sapiens",]
DTsub <- DTsub[!duplicated(DTsub$`Gene Name`),]
DTsub <- unique(DTsub,by = c('Gene Name','Motif'))
DTsub$`Gene Name` <- gsub("[*]", "", DTsub$`Gene Name`)
setnames(DTsub, c("Gene ID", "Gene Name", "Organism", "Motif"),
c("Gene_ID", "Targeting_Factor", "Organism", "Consensus_Sequence"))
Gene_ID <- c("ENSG00000115053", "ENSG00000115053" )
Targeting_Factor <- c("NCL", "NCL"); Organism <- c("Homo_sapiens", "Homo_sapiens")
Consensus_Sequence <- c("UCCCGA", "GU")
NCL <- as.data.table(cbind(Gene_ID, Targeting_Factor, Organism, Consensus_Sequence))
DTsub <- rbind(DTsub, NCL); rm(NCL); rm(Consensus_Sequence); rm(Gene_ID); rm(Organism); rm(Targeting_Factor)
DTsub
## Gene_ID Targeting_Factor Organism Consensus_Sequence
## 1: ENSG00000136450 SRSF1 Homo_sapiens CAGACAA
## 2: ENSG00000161547* SRSF2 Homo_sapiens GAAUCCCG
## 3: ENSG00000104824 HNRNPL Homo_sapiens CACACCACA
## 4: ENSG00000136231 IGF2BP3 Homo_sapiens AAACUCA
## 5: ENSG00000143889 HNRNPLL Homo_sapiens ACAUACA
## ---
## 176: ENSG00000160710 ADAR Homo_sapiens GCGC
## 177: ENSG00000104413 ESRP1 Homo_sapiens AGGGAU
## 178: ENSG00000108771 DHX58 Homo_sapiens GCGCG
## 179: ENSG00000115053 NCL Homo_sapiens UCCCGA
## 180: ENSG00000115053 NCL Homo_sapiens GU
To obtain miR sequences to query RNA binding protein sequence binding information against, both mature and immature miR sequences were downloaded from miRbase. The link to download both the mature and immature miR sequence information is located at: http://www.mirbase.org/ftp.shtml. Once the fasta files are downloaded and saved in the working directory, the seqCompile function reads these fasta files and converts them into a data table containing the sequence information, the miR name, and if the miRNA sequence corresponds to the immature or mature sequence. In this section of code, the results of the first 6 mature miRNA sequences are printed.
RefSeqHPmiRNA <- seqCompile(files= "./data/2018-12-18 hairpin.fa.gz",
direct = getwd(), type = "miRNA", miRNA_type = "IMMATURE_HAIR_PIN" )
RefSeqIDMatMiRNA <- seqCompile(files= "./data/2018-12-18 mature.fa.gz",
direct = getwd(), type = "miRNA", miRNA_type = "MATURE" )
MiRNADT <- rbind(RefSeqHPmiRNA, RefSeqIDMatMiRNA)
tail(MiRNADT)
## Sequence miRNA_Name miRNA_type
## 1: GAGGAUGCUGAUCAUUCACUGG smc-miR-12461-5p MATURE
## 2: AGUAAAUGAUCAGCAUCCUCCA smc-miR-12461-3p MATURE
## 3: UUCUGCUCCUAUUUAAGUCAAU gga-miR-1784b-5p MATURE
## 4: UGAUUUCAAUAAGAGCAGAAUU gga-miR-1784b-3p MATURE
## 5: AGCAUAGAAUGUCAGAUCUAG mdo-miR-7385g-3p MATURE
## 6: AGCAUAGAAUGUCAGGUCUAG mdo-miR-7385g-2-3p MATURE
Next, a system of tracking the same miRs in different species is needed. To accomplish this, MiRNASpeciesAnnot reads the three letter species designation located in the first three letters of the miR name and annotates the data table with the common name and scientific name of the species the miR sequence belongs to. Next, in order to track the same miR in different species, MiRNAname strips the three letter species designation from the miR name and adds the species deidentified miR name to a new column labeled “miRNA”.
MiRNADT <- MiRNASpeciesAnnot(MiRNADT)
MiRNADT <- MiRNAname(MiRNADT)
head(MiRNADT)[,2:ncol(MiRNADT)]
## miRNA_Name miRNA_type Common_Name Scientific_Name miRNA
## 1: CEL-LET-7 IMMATURE_HAIR_PIN roundworm Caenorhabditis_elegans LET-7
## 2: CEL-LIN-4 IMMATURE_HAIR_PIN roundworm Caenorhabditis_elegans LIN-LIN
## 3: CEL-MIR-1 IMMATURE_HAIR_PIN roundworm Caenorhabditis_elegans MIR-1
## 4: CEL-MIR-2 IMMATURE_HAIR_PIN roundworm Caenorhabditis_elegans MIR-2
## 5: CEL-MIR-34 IMMATURE_HAIR_PIN roundworm Caenorhabditis_elegans MIR-34
## 6: CEL-MIR-35 IMMATURE_HAIR_PIN roundworm Caenorhabditis_elegans MIR-35
In this section, out goal is to use the miR data table (designated miRNADT in the code above) and the RNA binding protein data tables downloaded, compiled, and cleaned in the previous sections, to map the binding sequences of all RNA binding proteins onto miR sequences for every species. In this analysis, we are interested in predicting which RNA binding proteins can facilitate the maturation of miRs. Threfore, we used only mature miR sequences in this analysis. The TFpredict function matches sequences from one data table to binding sequences from a second data table. It then compiles a new data table containing the binding sequence (consensus sequence) used, the start and stop position on the querey sequence (miR sequence) for which the binding site was identified, the gene ID for the RNA binding protein, the RNA binding protein queried (Targeting Factor), the species the RNA binding prrotein is from (Organism), the length of the queried sequence (length), the sequence queried with the binding site presented in lower case (Sequence), the identified sequence with the flanking residue (Identified Sequence), the name of the miR the sequence belongs to (gene_symbol), the common name (Species) and scientific name (Scientific Name), of the miR sequence.
#### Perform the transcription factor query computations on immature miRNA hairpins
MiRNADTim <- MiRNADT[MiRNADT$miRNA_type == "IMMATURE_HAIR_PIN",]
MiRNADTim$Sequence <- as.character(MiRNADTim$Sequence)
setnames(MiRNADTim, c("Sequence", "miRNA_Name", "miRNA_type", "Common_Name", "Scientific_Name", "miRNA"),
c("Sequence", "miRNA_Name", "miRNA_type", "Common_Name", "Scientific_Name", "gene_symbol"))
TX_TOT_Species <- TFpredict(Target = MiRNADTim, Targeting_Factor_DT = DTsub, type = "multiple_species")
Output of the resulting data table (excluding the sequences, gene IDs, and the species for the RBP)
TX_TOT_Species[,c(1:4,6,8, 10:11)]
## Consensus_Sequence start end Number_Hits Targeting_Factor length Identified_sequence gene_symbol
## 1: AGAAC 78 82 3 DAZAP1 182 UagaacA MIR-319
## 2: AGAAC 78 82 3 YBX1 182 UagaacA MIR-319
## 3: AGAAC 78 82 3 HNRNPA2B1 182 UagaacA MIR-319
## 4: AGAAC 78 82 3 DAZAP1 182 UagaacA MIR-319
## 5: AGAAC 78 82 3 YBX1 182 UagaacA MIR-319
## ---
## 647466: GUGGU 133 137 2 ZNF346 247 AgugguG MIR-11970
## 647467: UCAC 175 178 1 NOVA1 247 AucacG MIR-11970
## 647468: UUGUU 98 102 1 CELF2 247 UuuguuA MIR-11970
## 647469: UUUAUA 112 117 1 PPIE 247 UuuuauaA MIR-11970
## 647470: UUUUA 111 115 1 CPEB1 247 AuuuuaU MIR-11970
Next, because we the biological sciences use rodent models to validate bioinformatic predictions, we were interested in retaining only predicted associations conserved at a minimum between Human, mouse, and rat. To accomplish this, SpeciesTFCons uses a character vector containing species that are required for the predicted interaction to be present in. It then retains only miR RNA binding protein interactions conserved at a minimum between the species required (in this case, human, mouse, and rat). If the interaction is conserved between the minimum designated species, all other species the predicted interaction is observed in are retained. The number of records prior to and after running this filter are illustrated below.
DT25 <- SpeciesTFCons(DT = TX_TOT_Species,Spec = c("Human", "Mouse", "Rat"), provide = "TF_Target")
nrow(TX_TOT_Species)
nrow(DT25)
## [1] 647470
## [1] 72274
In this section, RNA binding proteins that are predicted AMPK targets were identified. All other RNA binding proteins were discarded. To accomplish this, AMPK targets previously identified were used. R codes we previously wrote that generate a predicted AMPK target database was used for this portion of the analysis (7-8). The identification of AMPK targets was conducted in two steps: in the first step, the AMPK target database was loaded. The number of records prior to and after running this filter are illustrated below.
keys <- fread("./data/99 AMPK target keys.xls")
keys
## Targeting_Factor
## 1: HNF4G
## 2: TRIM7
## 3: TRIM41
## 4: SLC2A10
## 5: SLC13A3
## ---
## 13415: PCDHGB5
## 13416: PCDHB9
## 13417: H2AFB3
## 13418: RIMBP3
## 13419: PRAMEF25
In the second step, the merge function was used to retain only RNA binding proteins that are putative AMPK targets. The number of targets prior to and after this filter are executes are illustrated below.
AMPK_targs <- merge(DT25, keys, by = "Targeting_Factor")
AMPK_targs[,c(1:5, 8, 11:12)]
## Targeting_Factor Consensus_Sequence start end Number_Hits length gene_symbol Species
## 1: ACO1 CAGUGG 23 28 1 105 MIR-140 Lizard
## 2: ACO1 CAGUGG 52 57 1 85 MIR-194-1 Lizard
## 3: ACO1 CAGUGG 54 59 1 87 MIR-194-2 Lizard
## 4: ACO1 CAGUGG 64 69 1 101 MIR-216A Lizard
## 5: ACO1 CAGUGG 52 57 1 73 MIR-27A Lizard
## ---
## 67929: ZFP36 UCUUCU 68 73 1 85 MIR-877 Rat
## 67930: ZFP36 UCUUCU 11 16 1 110 MIR-10A White-lipped tamarin
## 67931: ZFP36 UCUUCU 92 97 1 150 MIR-10A Tasmanian devil
## 67932: ZFP36 UCUUCU 88 93 1 93 MIR-10A Schmidtea mediterranea
## 67933: ZFP36 UCUUCU 25 30 1 75 MIR-494 Wild boar
nrow(DT25)
## [1] 72274
nrow(AMPK_targs)
## [1] 67933
In this step, the expression levels of identified miR binding proteins that are also putative AMPK targets in endothelial cells were analysed. Any proteins that were not expressed in the endothelium were discarded. To accomplish this we used a data source containing RNA expression profiles of endothelial cells subjected to pulsatile or oscillatory shear stress (9).
PSOS <- fread("./data/5 PSOS FC data.xls")
PSOS <- PSOS[!duplicated(PSOS$GeneName),][,c(2,33,34, 53, 54, 43, 44,63,64)]
setnames(PSOS, colnames(PSOS),
c("Gene.symbol", "setA_OS_20_hr", "setA_OS_24_hr", "setB_OS_20_hr", "setB_OS_24_hr",
"setA_PS_20_hr", "setA_PS_24_hr", "setB_PS_20_hr", "setB_PS_24_hr"))
AMPK_targs_PSOS <- AMPK_targs[(AMPK_targs$Targeting_Factor %in% PSOS$Gene.symbol)]
A final list of 66 RNA binding proteins was identified for further analysis.
length(AMPK_targs_PSOS[!duplicated(AMPK_targs_PSOS$Targeting_Factor),]$Targeting_Factor)
## [1] 66
At this point of the analysis, a data table of RNA binding proteins and their putative miR binding matches across species evolution has been generated. These putative miR binding proteins are expressed in endothelial cells, predicted to be phosphorylated by AMPK, and the predicted interaction and the predicted AMPK phosphorylation of the RNA binding protein are both conserved at a minimum between human, mouse, and rat. In this section, the number of miRs each RNA binding protein is predicted to regulate and the species conservation of the binding site will be explored. To do this, a function called TFRankR was created. This function takes the data table obtained in the previous section and ranks proteins based on the number of proteins they target, the number of species containing each target interaction, the number of hits between the RNA binding protein and miR, or any combination of these ranking classifications. For this analysis, we determined the number of immature miRNAs each RNA binding protein targets in human.
TX_PROM <- TFRankR(DT = AMPK_targs_PSOS, sortBy = "Target", dec = TRUE)
TX_PROM <- TX_PROM[TX_PROM$Species == "Human",]
head(TX_PROM)
## Targeting_Factor Number_targets Species
## 1: NCL 237 Human
## 2: NXF1 78 Human
## 3: NOVA1 65 Human
## 4: NUDT21 65 Human
## 5: PIWIL1 60 Human
## 6: F2 55 Human
Then a Plot of the number of miRs each RNA binding protein putatively regulates was generated.
ggplot(TX_PROM, aes(reorder(`Targeting_Factor`, -Number_targets), Number_targets)) +
geom_point(shape = 21, size = 3, colour = "black", fill = "gray") +
theme_pubr() +
labs(x = "Gene Name", y = "MiRNA abundance") +
scale_color_manual(values=c("#E31A1C")) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
geom_point(data = subset(TX_PROM, Targeting_Factor == "NCL"),
aes(x=reorder(`Targeting_Factor`, -Number_targets), y=Number_targets),
shape = 21, size = 3, colour = "black", fill = "#08519C")
Because NCL was predicted to regulate the most number of miRs, we analyzed the species conservation of the NCL binding site across all miRs. NCL can both directly and indirectly regulate miRs (5-6). Whule “GU” sequences are a measure of indirect binding, “UCCCGA” sequences are a measure of direct binding (5-6). Because we were interested in the miRs NCL may directly regulate, we identified the species conservation of the “UCCCGA” binding sequence. This process was conducted in two steps. The first step involved ranking the miRs by species abundance of the “UCCCGA” binding site.
miR <- fread("./results/1 Raw immature miRNA hits.xls")
NCL_miR <- miR[miR$Targeting_Factor == "NCL" & miR$Consensus_Sequence == "UCCCGA",]
TX_SPEC <- TFRankR(DT = NCL_miR, sortBy = "species", dec = TRUE)
TX_SPEC
## Targeting_Factor Number_Species gene_symbol
## 1: NCL 19 MIR-93
## 2: NCL 12 MIR-484
## 3: NCL 9 MIR-206
## 4: NCL 7 MIR-887
## 5: NCL 5 MIR-598
## ---
## 260: NCL 1 MIR-H7
## 261: NCL 1 MIR-M59-2
## 262: NCL 1 MIR-M7
## 263: NCL 1 MIR-RO6-1
## 264: NCL 1 MIR-RR1-4
In the second step, the ranking results were graphed.
maxct <- max(TX_SPEC$Number_Species)
howmany <- 30
ggplot(TX_SPEC[1:howmany],
aes(y=Number_Species, x=reorder(gene_symbol,Number_Species),
ymin=0, ymax=Number_Species), col="blue") +
geom_point(size=2.5, color="#E7B800") +
geom_linerange(size=1, color="#00AFBB") +
theme(plot.title = element_text(size = 10,colour="black")) +
coord_flip() +
theme_pubr()+#
theme(axis.text=element_text(size=rel(1)),axis.title=element_text(size=rel(1.25))) +
geom_text(aes(label = paste(Number_Species,sep=":::"), y = Number_Species + .05*maxct), size = 4) +
ggtitle("NCL binding site conservation\n") +
ylab("Number of Species") +
xlab("pre-miR\n")
In this section, the goal is to identify targets of miR-93 and miR-484 in endothelial cells responding to shear stress. To accomplish this, first the seed sequence of miR-93 and miR-484 were used to determine which mRNAs contain a reverse complement miR-targeting SEED sequence (10). This was conducted for human, mouse, and rat species. Only targeting sequences conserved between human, mouse, and rat were retained for further analysis. Next, we downloaded and analyzed RNAseq datasets from endothelial cells overexpressing miR-93 or miR-484. We then compared these overexpression datasets to OS/PS datasets to identify transcripts downregulated by either miR-93 or miR-484 overexpression and OS/PS. Finally, of the co-downregulated transcripts, we identified which ones contain a reverse complement SEED sequence target.
The first step in the miRNA prediction process is to obtain a data table containing all mRNA sequences for all species available on the ensembl database (http://uswest.ensembl.org/info/data/ftp/index.html). Once the files have been downloaded and saved into a separate file, the seqCompile function reads all sequences and compiles them into a data table containing the mRNA sequence, the ensempl transcript ID, the name of the species file, as well as the scientific and common name or the species the sequence is found in.
transcriptome <- seqCompile(1:length(dir()), type = "mRNA")
transcriptome[,2:5]
## ensembl_transcript_id Species_File Scientific_Name Common_Name
## 1: ENST00000632684 Homo_sapiens.GRCh38.cdna.all.fa.gz Homo_sapiens Human
## 2: ENST00000434970 Homo_sapiens.GRCh38.cdna.all.fa.gz Homo_sapiens Human
## 3: ENST00000448914 Homo_sapiens.GRCh38.cdna.all.fa.gz Homo_sapiens Human
## 4: ENST00000415118 Homo_sapiens.GRCh38.cdna.all.fa.gz Homo_sapiens Human
## 5: ENST00000631435 Homo_sapiens.GRCh38.cdna.all.fa.gz Homo_sapiens Human
## ---
## 334607: ENSRNOT00000090245 Rattus_norvegicus.Rnor_6.0.cdna.all.fa.gz Rattus_norvegicus Rat
## 334608: ENSRNOT00000093304 Rattus_norvegicus.Rnor_6.0.cdna.all.fa.gz Rattus_norvegicus Rat
## 334609: ENSRNOT00000092502 Rattus_norvegicus.Rnor_6.0.cdna.all.fa.gz Rattus_norvegicus Rat
## 334610: ENSRNOT00000089207 Rattus_norvegicus.Rnor_6.0.cdna.all.fa.gz Rattus_norvegicus Rat
## 334611: ENSRNOT00000086520 Rattus_norvegicus.Rnor_6.0.cdna.all.fa.gz Rattus_norvegicus Rat
Next, the annotation information can be downloaded. this is accomplished for Human, mouse, and rat species by running A biomaRt query in a loop.
COMPANNOT <- NULL
Values <- transcriptome$ensembl_transcript_id
data.set <- c("hsapiens_gene_ensembl", "mmusculus_gene_ensembl", "rnorvegicus_gene_ensembl")
for(i in 1:length(data.set)){
mymart <- useMart("ensembl", dataset=data.set[i])
mRNAdata_annot <- getBM(attributes=c("ensembl_transcript_id", "external_gene_name",
"transcription_start_site", "transcript_start", "transcript_end",
"chromosome_name"), values = Values, mart = mymart)
COMPANNOT <- rbind(COMPANNOT, mRNAdata_annot)
}
COMPANNOT
## ensembl_transcript_id external_gene_name transcription_start_site chromosome_name
## 1: ENST00000585070 MIR4723 28360654 17
## 2: ENST00000459194 RF00019 61642383 9
## 3: ENST00000620386 RF02116 88992370 10
## 4: ENST00000618314 MIR6085 62343029 15
## 5: ENST00000410938 RNU6-1328P 94495299 7
## ---
## 408354: ENSRNOT00000076510 Eda2r 66602458 X
## 408355: ENSRNOT00000017429 Eda2r 66602506 X
## 408356: ENSRNOT00000067217 Ccdc88b 222118459 1
## 408357: ENSRNOT00000092790 Ccdc88b 222107007 1
## 408358: ENSRNOT00000092969 Ccdc88b 222118195 1
Finally, the transcriptome data table is annotated with the biomaRt information.
TRANS_CHR <- merge(transcriptome, COMPANNOT, by= "ensembl_transcript_id")
TRANS_CHR <- TRANS_CHR[!duplicated(TRANS_CHR$ensembl_transcript_id),]
TRANS_CHR <- TRANS_CHR[order(TRANS_CHR$Species),]
TRANS_CHR$external_gene_name <- toupper(TRANS_CHR$external_gene_name)
TRANS_CHR[,c(1,4:6, 10)]
## ensembl_transcript_id Scientific_Name Common_Name external_gene_name chromosome_name
## 1: ENST00000000233 Homo_sapiens Human ARF5 7
## 2: ENST00000000412 Homo_sapiens Human M6PR 12
## 3: ENST00000000442 Homo_sapiens Human ESRRA 11
## 4: ENST00000001008 Homo_sapiens Human FKBP4 12
## 5: ENST00000001146 Homo_sapiens Human CYP26B1 2
## ---
## 334607: ENSRNOT00000093750 Rattus_norvegicus Rat FUBP3 3
## 334608: ENSRNOT00000093751 Rattus_norvegicus Rat GRIN3B 7
## 334609: ENSRNOT00000093752 Rattus_norvegicus Rat MPL 5
## 334610: ENSRNOT00000093753 Rattus_norvegicus Rat WBP4 15
## 334611: ENSRNOT00000093754 Rattus_norvegicus Rat LOC100912373 5
Next, the mRNA data table was cleaned by eliminating all sequences belonging to extraneous chromosomes.
data.table(TRANS_CHR[!duplicated(TRANS_CHR$chromosome_name),]$chromosome_name)
## V1
## 1: 7
## 2: 12
## 3: 11
## 4: 2
## 5: 6
## ---
## 583: KL568013.1
## 584: KL568410.1
## 585: AABR07024206.1
## 586: AABR07022620.1
## 587: KL568122.1
CHR_Labels <- c(1:21, "X", "Y")
TRANS_CHR <- TRANS_CHR[(TRANS_CHR$chromosome_name %in% CHR_Labels),]
TRANS_CHR[,c(1,5,6,10), with = FALSE]
## ensembl_transcript_id Common_Name external_gene_name chromosome_name
## 1: ENST00000000233 Human ARF5 7
## 2: ENST00000000412 Human M6PR 12
## 3: ENST00000000442 Human ESRRA 11
## 4: ENST00000001008 Human FKBP4 12
## 5: ENST00000001146 Human CYP26B1 2
## ---
## 311121: ENSRNOT00000093750 Rat FUBP3 3
## 311122: ENSRNOT00000093751 Rat GRIN3B 7
## 311123: ENSRNOT00000093752 Rat MPL 5
## 311124: ENSRNOT00000093753 Rat WBP4 15
## 311125: ENSRNOT00000093754 Rat LOC100912373 5
data.table(TRANS_CHR[!duplicated(TRANS_CHR$chromosome_name),]$chromosome_name)
## V1
## 1: 7
## 2: 12
## 3: 11
## 4: 2
## 5: 6
## 6: 16
## 7: 4
## 8: 3
## 9: 1
## 10: 8
## 11: 19
## 12: 17
## 13: 5
## 14: 14
## 15: X
## 16: 10
## 17: 18
## 18: 20
## 19: 13
## 20: 15
## 21: Y
## 22: 9
## 23: 21
## V1
Finally, we eliminated all splice variants and retained only the variants with the longest sequence.
PROMTrans <- VariantSort(TRANS_CHR, variant = "MAX")
PROMTrans[,c(1, 5, 6, 10:11), with = FALSE]
## ensembl_transcript_id Common_Name external_gene_name chromosome_name Length
## 1: ENST00000463733 Human ARF5 7 1509
## 2: ENST00000000412 Human M6PR 12 2756
## 3: ENST00000405666 Human ESRRA 11 2283
## 4: ENST00000001008 Human FKBP4 12 3732
## 5: ENST00000001146 Human CYP26B1 2 4732
## ---
## 93707: ENSRNOT00000093717 Rat AABR07059381.1 4 175
## 93708: ENSRNOT00000093724 Rat AABR07040760.1 X 414
## 93709: ENSRNOT00000093732 Rat AABR07003509.1 1 1974
## 93710: ENSRNOT00000093742 Rat AC125565.1 2 1005
## 93711: ENSRNOT00000093749 Rat LOC681392 X 1744
Now that the mRNA data table is formatted, we need to format the miRNA data table to include the species the miRNA is from as well as the miRNA targeting information.
MiRNADT[,2:ncol(MiRNADT)]
## miRNA_Name miRNA_type Common_Name Scientific_Name miRNA
## 1: CEL-LET-7 IMMATURE_HAIR_PIN roundworm Caenorhabditis_elegans LET-7
## 2: CEL-LIN-4 IMMATURE_HAIR_PIN roundworm Caenorhabditis_elegans LIN-LIN
## 3: CEL-MIR-1 IMMATURE_HAIR_PIN roundworm Caenorhabditis_elegans MIR-1
## 4: CEL-MIR-2 IMMATURE_HAIR_PIN roundworm Caenorhabditis_elegans MIR-2
## 5: CEL-MIR-34 IMMATURE_HAIR_PIN roundworm Caenorhabditis_elegans MIR-34
## ---
## 87470: SMC-MIR-12461-3P MATURE NA NA MIR-12461-3P
## 87471: GGA-MIR-1784B-5P MATURE Chicken Gallus_gallus MIR-1784B-5P
## 87472: GGA-MIR-1784B-3P MATURE Chicken Gallus_gallus MIR-1784B-3P
## 87473: MDO-MIR-7385G-3P MATURE Gray short-tailed opossum Monodelphis_domestica MIR-7385G-3P
## 87474: MDO-MIR-7385G-2-3P MATURE Gray short-tailed opossum Monodelphis_domestica MIR-7385G-2-3P
First, we subsetted the data table to return miRNAs of interest.
MiDT <- MiRNADT[MiRNADT$miRNA == "MIR-93-3P" | MiRNADT$miRNA == "MIR-93-5P" | MiRNADT$miRNA == "MIR-484",]
MiDT <- MiDT[MiDT$Common_Name == "Human" | MiDT$Common_Name == "Mouse" | MiDT$Common_Name == "Rat",]
MiDT <- MiDT[MiDT$miRNA_type == "MATURE",]
MiDT[,2:ncol(MiDT)]
## miRNA_Name miRNA_type Common_Name Scientific_Name miRNA
## 1: HSA-MIR-93-5P MATURE Human Homo_sapiens MIR-93-5P
## 2: HSA-MIR-93-3P MATURE Human Homo_sapiens MIR-93-3P
## 3: MMU-MIR-93-5P MATURE Mouse Mus_musculus MIR-93-5P
## 4: MMU-MIR-93-3P MATURE Mouse Mus_musculus MIR-93-3P
## 5: RNO-MIR-93-5P MATURE Rat Rattus_norvegicus MIR-93-5P
## 6: RNO-MIR-93-3P MATURE Rat Rattus_norvegicus MIR-93-3P
## 7: HSA-MIR-484 MATURE Human Homo_sapiens MIR-484
## 8: MMU-MIR-484 MATURE Mouse Mus_musculus MIR-484
## 9: RNO-MIR-484 MATURE Rat Rattus_norvegicus MIR-484
Next, we determined the miRNA seed sequence (positions 2-8 from the 5’ end), that must be perfectly complimentary for miRNA-mRNA binding (10).
MiDT <- MISeed(MiDT); MiDT[,2:ncol(MiDT)]
## miRNA_Name miRNA_type Common_Name Scientific_Name miRNA seed_Sequence
## 1: HSA-MIR-93-5P MATURE Human Homo_sapiens MIR-93-5P AAAGUGC
## 2: HSA-MIR-93-3P MATURE Human Homo_sapiens MIR-93-3P CUGCUGA
## 3: MMU-MIR-93-5P MATURE Mouse Mus_musculus MIR-93-5P AAAGUGC
## 4: MMU-MIR-93-3P MATURE Mouse Mus_musculus MIR-93-3P CUGCUGA
## 5: RNO-MIR-93-5P MATURE Rat Rattus_norvegicus MIR-93-5P AAAGUGC
## 6: RNO-MIR-93-3P MATURE Rat Rattus_norvegicus MIR-93-3P CUGCUGA
## 7: HSA-MIR-484 MATURE Human Homo_sapiens MIR-484 CAGGCUC
## 8: MMU-MIR-484 MATURE Mouse Mus_musculus MIR-484 CAGGCUC
## 9: RNO-MIR-484 MATURE Rat Rattus_norvegicus MIR-484 CAGGCUC
Finally, we obtained the query sequence (the reverse complement of the SEED sequence).
MiDT <- MIQuerySeq(MiDT, wobble = TRUE); MiDT[,2:ncol(MiDT)]
## miRNA_Name miRNA_type Common_Name Scientific_Name miRNA seed_Sequence Query_Sequence
## 1: HSA-MIR-93-5P MATURE Human Homo_sapiens MIR-93-5P AAAGUGC G(C|T)(A|G)(C|T)TTT
## 2: HSA-MIR-93-3P MATURE Human Homo_sapiens MIR-93-3P CUGCUGA T(C|T)(A|G)G(C|T)(A|G)G
## 3: MMU-MIR-93-5P MATURE Mouse Mus_musculus MIR-93-5P AAAGUGC G(C|T)(A|G)(C|T)TTT
## 4: MMU-MIR-93-3P MATURE Mouse Mus_musculus MIR-93-3P CUGCUGA T(C|T)(A|G)G(C|T)(A|G)G
## 5: RNO-MIR-93-5P MATURE Rat Rattus_norvegicus MIR-93-5P AAAGUGC G(C|T)(A|G)(C|T)TTT
## 6: RNO-MIR-93-3P MATURE Rat Rattus_norvegicus MIR-93-3P CUGCUGA T(C|T)(A|G)G(C|T)(A|G)G
## 7: HSA-MIR-484 MATURE Human Homo_sapiens MIR-484 CAGGCUC G(A|G)G(C|T)(C|T)TG
## 8: MMU-MIR-484 MATURE Mouse Mus_musculus MIR-484 CAGGCUC G(A|G)G(C|T)(C|T)TG
## 9: RNO-MIR-484 MATURE Rat Rattus_norvegicus MIR-484 CAGGCUC G(A|G)G(C|T)(C|T)TG
In this section, we queried miRNA’s against mRNA sequences in human, mouse, and rat. The function MIRNATargetpredict queries the miR query sequence against all mRNA sequences. Both the miRNA query sequence and the mRNA sequences are paired for each species. It then returns a data table with paired matches.
setnames(PROMTrans, colnames(PROMTrans),
c("ensembl_transcript_id", "Sequence", "Species_File", "Scientific_Name", "Species",
"external_gene_name", "transcription_start_site", "transcript_start", "transcript_end",
"chromosome_name", "Length"))
setnames(MiDT, c("Sequence", "miRNA_Name", "miRNA_type", "Common_Name", "Scientific_Name",
"seed_Sequence", "Query_Sequence"),
c("Sequence", "miRNA_Name", "miRNA_type", "Species", "Scientific_Name",
"seed_Sequence", "Query_Sequence"))
Spec <- c("Homo_sapiens", "Mus_musculus", "Rattus_norvegicus")
MiRNAhitstot <- MIRNATargetpredict(MiRNADT = MiDT, mRNADT = PROMTrans, type = "Multiple", Spe = Spec)
MiRNAhitstot[, c(1, 3:7, 9:10), with = FALSE]
## Query_Sequence start end mRNA_Name Number_Hits length Species miRNA
## 1: G(A|G)G(C|T)(C|T)TG 1289 1295 ARF5 1 1509 Homo_sapiens MIR-484
## 2: G(C|T)(A|G)(C|T)TTT 1307 1313 M6PR 3 2756 Homo_sapiens MIR-93-5P
## 3: G(C|T)(A|G)(C|T)TTT 2203 2209 M6PR 3 2756 Homo_sapiens MIR-93-5P
## 4: G(C|T)(A|G)(C|T)TTT 2685 2691 M6PR 3 2756 Homo_sapiens MIR-93-5P
## 5: T(C|T)(A|G)G(C|T)(A|G)G 569 575 M6PR 1 2756 Homo_sapiens MIR-93-3P
## ---
## 497491: T(C|T)(A|G)G(C|T)(A|G)G 916 922 AABR07003509.1 2 1974 Rattus_norvegicus MIR-93-3P
## 497492: G(A|G)G(C|T)(C|T)TG 171 177 AABR07003509.1 1 1974 Rattus_norvegicus MIR-484
## 497493: G(C|T)(A|G)(C|T)TTT 960 966 LOC681392 2 1744 Rattus_norvegicus MIR-93-5P
## 497494: G(C|T)(A|G)(C|T)TTT 1096 1102 LOC681392 2 1744 Rattus_norvegicus MIR-93-5P
## 497495: G(A|G)G(C|T)(C|T)TG 1644 1650 LOC681392 1 1744 Rattus_norvegicus MIR-484
Next, the species designation were removed from the miRNA name and add this new name lacking the species information to another column in the new dataset. This new label was used to compute predicted miR-mRNA species preservation.
MiRNAhitstot <- MiRNAname(MiRNAhitstot); MiRNAhitstot[, c(1,3:7, 9:10), with = FALSE]
## Query_Sequence start end mRNA_Name Number_Hits length Species miRNA
## 1: G(A|G)G(C|T)(C|T)TG 1289 1295 ARF5 1 1509 Homo_sapiens MIR-484
## 2: G(C|T)(A|G)(C|T)TTT 1307 1313 M6PR 3 2756 Homo_sapiens MIR-93-5P
## 3: G(C|T)(A|G)(C|T)TTT 2203 2209 M6PR 3 2756 Homo_sapiens MIR-93-5P
## 4: G(C|T)(A|G)(C|T)TTT 2685 2691 M6PR 3 2756 Homo_sapiens MIR-93-5P
## 5: T(C|T)(A|G)G(C|T)(A|G)G 569 575 M6PR 1 2756 Homo_sapiens MIR-93-3P
## ---
## 497491: T(C|T)(A|G)G(C|T)(A|G)G 916 922 AABR07003509.1 2 1974 Rattus_norvegicus MIR-93-3P
## 497492: G(A|G)G(C|T)(C|T)TG 171 177 AABR07003509.1 1 1974 Rattus_norvegicus MIR-484
## 497493: G(C|T)(A|G)(C|T)TTT 960 966 LOC681392 2 1744 Rattus_norvegicus MIR-93-5P
## 497494: G(C|T)(A|G)(C|T)TTT 1096 1102 LOC681392 2 1744 Rattus_norvegicus MIR-93-5P
## 497495: G(A|G)G(C|T)(C|T)TG 1644 1650 LOC681392 1 1744 Rattus_norvegicus MIR-484
Next, miRNA-mRNA associations conservation between human, mouse, and rat were retained for each mRNA.
setnames(MiRNAhitstot, colnames(MiRNAhitstot),
c("Query_Sequence", "start", "end", "miRNA_Name", "gene_symbol", "Number_Hits",
"length", "Sequence", "Species", "Targeting_Factor"))
MiRNAhitstot <- SpeciesTFCons(DT = MiRNAhitstot,Spec = c("Homo_sapiens", "Mus_musculus",
"Rattus_norvegicus"), provide = "TF_Target")
MiRNAhitstot[, c(1,3:7, 9:10), with = FALSE]
## Query_Sequence start end gene_symbol Number_Hits length Species Targeting_Factor
## 1: G(A|G)G(C|T)(C|T)TG 1289 1295 ARF5 1 1509 Homo_sapiens MIR-484
## 2: G(C|T)(A|G)(C|T)TTT 1307 1313 M6PR 3 2756 Homo_sapiens MIR-93-5P
## 3: G(C|T)(A|G)(C|T)TTT 2203 2209 M6PR 3 2756 Homo_sapiens MIR-93-5P
## 4: G(C|T)(A|G)(C|T)TTT 2685 2691 M6PR 3 2756 Homo_sapiens MIR-93-5P
## 5: G(A|G)G(C|T)(C|T)TG 412 418 ESRRA 4 2283 Homo_sapiens MIR-484
## ---
## 308526: G(A|G)G(C|T)(C|T)TG 1710 1716 IL1RAPL2 5 3264 Rattus_norvegicus MIR-484
## 308527: G(A|G)G(C|T)(C|T)TG 1956 1962 IL1RAPL2 5 3264 Rattus_norvegicus MIR-484
## 308528: G(A|G)G(C|T)(C|T)TG 1967 1973 IL1RAPL2 5 3264 Rattus_norvegicus MIR-484
## 308529: G(C|T)(A|G)(C|T)TTT 902 908 POU5F2 1 1235 Rattus_norvegicus MIR-93-5P
## 308530: G(A|G)G(C|T)(C|T)TG 1108 1114 POU5F2 1 1235 Rattus_norvegicus MIR-484
The number of predicted interactions before and after minimum species filtering is as follows:
## [1] 497495 8
## [1] 308530 8
In this section, RNAseq data generated from miR-93 overexpression in human umbilical vein endothelial cells (HUVEC) available online through gene expression omnibus was analyzed. The data is freely available at the following link: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE86497. Because this is an RNAseq dataset, we downloaded the FASTq files, aligned them to the human GRCh38 reference genome, and computed fold changes relative to control. This is a multistep process and is outlined below. The first step in this process is to download the human genome and corresponding gtf file:
URL <- "ftp://ftp.ensembl.org/pub/release-85/gtf/homo_sapiens/Homo_sapiens.GRCh38.85.gtf.gz"
download.file(URL, destfile = "./data/GTF.gtf.gz")
URLGEN <- "ftp://ftp.ensembl.org/pub/release-85/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz"
download.file(URLGEN, destfile = "./data/genome.fa.gz")
After downloading the files, the file were opened and the numer of A, T, C, or G nucleotides were counted to get a sense of the genomic sequences. The frequencies of the nucleotides and the percentage of “N” not called nucleotides can be computed as follows:
myseq <- readDNAStringSet("./data/genome.fasta")
freq <- alphabetFrequency(myseq)[,c(1:4, 15)]
freq <- data.table(freq)
per <- NULL
for(i in 1:nrow(freq)){
per[i] <- freq[i,5, with = FALSE]/(freq[i,1, with = FALSE]+
freq[i,2, with = FALSE]+freq[i,3, with = FALSE]+
freq[i,4, with = FALSE]+freq[i,5, with = FALSE])
}
per <- unlist(per)
freq$percentN <- per
freq$name <- names(myseq)
freq[1:22,1:6]
## A C G T N percentN
## 1: 67070277 48055043 48111528 67244164 18475410 0.0742114216
## 2: 38875926 27639505 27719976 39027555 534460 0.0039945463
## 3: 39286730 27903257 27981801 39361954 552880 0.0040927813
## 4: 39370109 27092804 27182678 39492225 137493 0.0010316465
## 5: 30047611 18839192 18933605 30162717 16381203 0.1432369978
## 6: 26673415 18423758 18559033 26911943 16475569 0.1539143941
## 7: 24508669 17752941 17825903 24553812 17349864 0.1701114005
## 8: 22558319 18172742 18299976 22774906 8532402 0.0944493947
## 9: 22639499 18723944 18851500 22705261 337237 0.0040505329
## 10: 24050680 15794455 16061651 24182819 283680 0.0035295310
## 11: 15142293 13954580 14061132 15282753 176858 0.0030171476
## 12: 71791213 48318180 48450903 71987932 1645301 0.0067933318
## 13: 17867246 13916133 14094472 18066406 499910 0.0077572575
## 14: 11820664 8185244 8226381 11856330 6621364 0.1417547936
## 15: 10382214 9160652 9246186 10370725 11658691 0.2294183878
## 16: 59689091 39233483 39344259 59833302 195424 0.0009855188
## 17: 58561236 36236976 36331025 58623430 461888 0.0024282474
## 18: 54699094 35731600 35879674 54955010 272881 0.0015031597
## 19: 51345477 33646690 33713330 51373025 727457 0.0042589668
## 20: 47058248 32317984 32378859 47215040 375842 0.0023586539
## 21: 43333530 29030173 29103787 43300646 370500 0.0025527317
## 22: 35736329 25099811 25170662 35783748 16604167 0.1199768847
## A C G T N percentN
Next, to automate this process, I used the systempipeR bioconductor package. This package initially requires the user to set up a text file indicating which files to align, which aligner and parameters to use, and what samples to calculate the fold changes from. This information is indicated below.
targets <- read.delim("targets.txt", comment.char ="#")
targets[,1:4]
## FileName SampleName Factor SampleLong
## 1 ./data/SRX2143034_SRR4183476_Mock_rep1.fastq.gz MOCK1 MOCK MOCK_1
## 2 ./data/SRX2143035_SRR4183477_Mock_rep2.fastq.gz MOCK2 MOCK MOCK_2
## 3 ./data/SRX2143036_SRR4183478_Mimic_control_rep1.fastq.gz CTRLMIMIC1 MIMICCTRL MIMICCTRL_1
## 4 ./data/SRX2143037_SRR4183479_Mimic_control_rep2.fastq.gz CTRLMIMIC2 MIMICCTRL MIMICCTRL_2
## 5 ./data/SRX2143038_SRR4183480_Mimic_rep1.fastq.gz MIMIC1 MIMIC MIMIC1
## 6 ./data/SRX2143039_SRR4183481_Mimic_rep2.fastq.gz MIMIC2 MIMIC MIMIC2
## 7 ./data/SRX2143040_SRR4183482_Mimic_rep3.fastq.gz MIMIC3 MIMIC MIMIC3
## 8 ./data/SRX2143041_SRR4183483_CTRL_inhibitor_rep1.fastq.gz CTRLINHIB1 CTRLINHIB CTRLINHIB_1
## 9 ./data/SRX2143042_SRR4183484_CTRL_inhibitor_rep2.fastq.gz CTRLINHIB2 CTRLINHIB CTRLINHIB_2
## 10 ./data/SRX2143043_SRR4183485_miR484_inhibitor_rep1.fastq.gz INHIB1 MIR484INHIB INHIB_1
## 11 ./data/SRX2143044_SRR4183486_miR484_inhibitor_rep2.fastq.gz INHIB2 MIR484INHIB INHIB_2
## 12 ./data/SRX2143045_SRR4183487_miR484_inhibitor_rep3.fastq.gz INHIB3 MIR484INHIB INHIB_3
## 13 ./data/SRX2143046_SRR4183488_no_tx_rep1.fastq.gz NOTX1 NO_TX NOTX_1
## 14 ./data/SRX2143047_SRR4183489_no_tx_rep2.fastq.gz NOTX2 NO_TX NOTX_2
Next, I loaded the Hisat2.param file
parampath <- system.file("extdata", "hisat2.param", package="systemPipeR")
read.delim(parampath, comment.char = "#")
## PairSet Name Value
## 1 modules <NA> hisat2/2.0.1
## 2 software <NA> hisat2
## 3 cores -p 7
## 4 other <NA> -k 1 --min-intronlen 30 --max-intronlen 3000
## 5 outfile1 -S <FileName1>
## 6 outfile1 path ./results/
## 7 outfile1 remove <NA>
## 8 outfile1 append .hisat.sam
## 9 outfile1 outextension .hisat.bam
## 10 reference <NA> ./data/Homo_sapiens.GRCh38.dna.chromosome.1.fa
## 11 infile1 -U <FileName1>
## 12 infile1 path <NA>
## 13 infile2 <NA> <FileName2>
## 14 infile2 path <NA>
The following information indicates the alignment parameters that were used.
args <- systemArgs(sysma=parampath, mytargets="targets.txt")
args
## An instance of 'SYSargs' for running 'hisat2' on 14 samples
modules(args)
## [1] "hisat2/2.0.1"
names(args)
## [1] "targetsin" "targetsout" "targetsheader" "modules" "software" "cores" "other" "reference" "results" "infile1" "infile2" "outfile1" "sysargs" "outpaths"
sysargs(args)[1]
## MOCK1
## "hisat2 -p 7 -k 1 --min-intronlen 30 --max-intronlen 3000 -S C:/Users/breng/Dropbox/Brendan Documents/Research/Research project/3 NCL manuscript/NCL final programming notes/results/SRX2143034_SRR4183476_Mock_rep1.fastq.gz.hisat.sam C:/Users/breng/Dropbox/Brendan Documents/Research/Research project/3 NCL manuscript/NCL final programming notes/data/Homo_sapiens.GRCh38.dna.chromosome.1.fa -U C:\\Users\\breng\\Dropbox\\Brendan Documents\\Research\\Research project\\3 NCL manuscript\\NCL final programming notes\\data\\SRX2143034_SRR4183476_Mock_rep1.fastq.gz "
Next, the reference genome was indexed.
system("hisat2-build ./data/genome.fasta ./data/genome.fasta")
In this section of code, the reads were aligned to the human genome (GRCh38) with hisat2.
bampaths <- runCommandline(args=args)
(read_statsDF <- alignStats(args=args))
## FileName Nreads Nalign Perc_Aligned Nalign_Primary Perc_Aligned_Primary
## 1: MOCK1 21759864 13689527 62.91182 13689527 62.91182
## 2: MOCK2 6904259 3013524 43.64732 3013524 43.64732
## 3: CTRLMIMIC1 14856688 5879712 39.57620 5879712 39.57620
## 4: CTRLMIMIC2 14197084 8190945 57.69456 8190945 57.69456
## 5: MIMIC1 20007656 8703905 43.50287 8703905 43.50287
## 6: MIMIC2 21783252 8890267 40.81240 8890267 40.81240
## 7: MIMIC3 21201661 11391097 53.72738 11391097 53.72738
## 8: CTRLINHIB1 22913477 17917546 78.19654 17917546 78.19654
## 9: CTRLINHIB2 16970400 13574453 79.98900 13574453 79.98900
## 10: INHIB1 24332799 18899658 77.67153 18899658 77.67153
## 11: INHIB2 19486865 10326784 52.99356 10326784 52.99356
## 12: INHIB3 19487579 15189465 77.94434 15189465 77.94434
## 13: NOTX1 21987954 13610564 61.90009 13610564 61.90009
## 14: NOTX2 21812337 15880264 72.80405 15880264 72.80405
Once the reads were aligned to the genome, we determined which reads belong to gene bodies. To do this, we counted reads per Feature and Stored the Annotations in TranscriptDb.
txdb <- makeTxDbFromGFF(file="data/GTF.gtf.gz",
format = "gtf",
dataSource= "ensembl",
organism ="Homo sapiens")
saveDb(txdb, file="./data/GTF.sqlite")
txdb <- loadDb("./data/GTF.sqlite")
eByg <- exonsBy(txdb, by="gene")
Using the TranscriptDb information, we computed the raw number of reads per gene with summarizeOverlaps.
txdb <- loadDb("./data/GTF.sqlite")
eByg <- exonsBy(txdb, by=c("gene"))
bfl <- BamFileList(outpaths(args), yieldSize=50000, index=character())
multicoreParam <- MulticoreParam(workers=7); register(multicoreParam)
counteByg <- bplapply(bfl, function(x) summarizeOverlaps(eByg, x, mode="Union", ignore.strand=TRUE,
inter.feature=TRUE, singleEnd=TRUE))
countDFeByg <- sapply(seq(along=counteByg), function(x) assays(counteByg[[x]])$counts)
rownames(countDFeByg) <- names(rowRanges(counteByg[[1]])); colnames(countDFeByg) <- names(bfl)
rpkmDFeByg <- apply(countDFeByg, 2, function(x) returnRPKM(counts=x, ranges=eByg))
print read counts
counteByg[,c(1,2,4,6,9,11,14)]
## V1 MOCK1 CTRLMIMIC1 MIMIC1 CTRLINHIB1 INHIB1 NOTX1
## 1: ENSG00000000003 363 134 271 457 555 497
## 2: ENSG00000000005 0 0 0 0 0 0
## 3: ENSG00000000419 451 220 285 614 679 415
## 4: ENSG00000000457 93 34 44 107 127 72
## 5: ENSG00000000460 152 32 43 166 147 129
## ---
## 58298: ENSG00000284744 0 0 0 0 0 0
## 58299: ENSG00000284745 0 0 0 0 0 0
## 58300: ENSG00000284746 0 0 0 0 0 0
## 58301: ENSG00000284747 2 1 0 5 5 1
## 58302: ENSG00000284748 0 0 0 0 0 0
print RPKM normalized read counts
rpkmDFeByg[,c(1,2,4,6,9,11,14)]
## V1 MOCK1 CTRLMIMIC1 MIMIC1 CTRLINHIB1 INHIB1 NOTX1
## 1: ENSG00000000003 6.82658351 5.85178810 8.0976434 6.56985939 7.72815299 9.55858900
## 2: ENSG00000000005 0.00000000 0.00000000 0.0000000 0.00000000 0.00000000 0.00000000
## 3: ENSG00000000419 31.86715887 36.09744764 31.9966299 33.16486753 35.52405416 29.98855345
## 4: ENSG00000000457 1.15233633 0.97827788 0.8662466 1.01349865 1.16516108 0.91236668
## 5: ENSG00000000460 2.17250920 1.06207461 0.9765153 1.81371572 1.55568375 1.88559476
## ---
## 58298: ENSG00000284744 0.00000000 0.00000000 0.0000000 0.00000000 0.00000000 0.00000000
## 58299: ENSG00000284745 0.00000000 0.00000000 0.0000000 0.00000000 0.00000000 0.00000000
## 58300: ENSG00000284746 0.00000000 0.00000000 0.0000000 0.00000000 0.00000000 0.00000000
## 58301: ENSG00000284747 0.04144085 0.04811558 0.0000000 0.07919756 0.07671047 0.02119041
## 58302: ENSG00000284748 0.00000000 0.00000000 0.0000000 0.00000000 0.00000000 0.00000000
Next, we conducted differential expressed gene analysis using edgeR.
countDF <- rpkmDFeByg
cmp <- readComp(args, format="matrix", delim="-")
edgeDF <- run_edgeR(countDF=countDF, targets=targetsin(args), cmp=cmp[[1]], independent=FALSE, mdsplot="")
## V1 MIMIC-MIR484INHIB_logFC MIMIC-MIR484INHIB_PValue MIMIC-MIR484INHIB_FDR
## 1: ENSG00000000003 0.30474373 0.4877807 1
## 2: ENSG00000000005 NA NA NA
## 3: ENSG00000000419 0.08857267 0.7247513 1
## 4: ENSG00000000457 -0.26894805 0.8080476 1
## 5: ENSG00000000460 -0.17241237 0.8612456 1
## ---
## 58298: ENSG00000284744 NA NA NA
## 58299: ENSG00000284745 NA NA NA
## 58300: ENSG00000284746 NA NA NA
## 58301: ENSG00000284747 NA NA NA
## 58302: ENSG00000284748 NA NA NA
Finally, to annotate the gene names we downloaded the annotation information from biomart:
Values <- row.names(edgeDF)
listMarts() # To choose BioMart database
# myMart <- useMart("ENSEMBL_MART_ENSEMBL"); listDatasets(myMart)
myMart <- useMart("ENSEMBL_MART_ENSEMBL", dataset="hsapiens_gene_ensembl")
# listAttributes(myMart)[1:100,] # Choose data types you want to download
go <- getBM(attributes=c("ensembl_gene_id", "ensembl_transcript_id", "external_gene_name",
"wikigene_name", "wikigene_description", "transcription_start_site",
"strand", "go_id", "definition_1006"), mart=myMart, values = Values)
go <- go[go[,8]!="",]; go[,8] <- as.character(go[,8])
go[,c(1,4,8)]
## ensembl_gene_id wikigene_name go_id
## 1: ENSG00000198888 ND1 GO:0016020
## 2: ENSG00000198888 ND1 GO:0016021
## 3: ENSG00000198888 ND1 GO:0016491
## 4: ENSG00000198888 ND1 GO:0055114
## 5: ENSG00000198888 ND1 GO:0005743
## ---
## 1043212: ENSG00000283951 KIR3DL2 GO:0050776
## 1043213: ENSG00000283951 KIR3DL2 GO:0005887
## 1043214: ENSG00000283951 KIR3DL2 GO:0006968
## 1043215: ENSG00000283882 GO:0016020
## 1043216: ENSG00000283882 GO:0016021
Once biomart annotations were downloaded, we merged this information with the computed fold changes to obtain a final miR-93 overexpression data set.
DEdata <- fread("./data/7 edgeRcomp.xls")
setnames(DEdata, c(colnames(DEdata)),
c("ensembl_gene_id",colnames(DEdata)[2:length(DEdata)]))
setkey(DEdata,ensembl_gene_id)
head(DEdata)
BioGO <- fread("./data/GO/GO BioMart annotations.xls")
BioGO <- BioGO[,c(1:2,4:9), with = FALSE]
DEdata1 <- merge(DEdata, BioGO, by="ensembl_gene_id")
DEdata1 <- DEdata1[!DEdata1$wikigene_name =="",]
DEdata1 <- DEdata1[!grepl("Rik", wikigene_name)]
DEdata1 <- DEdata1[complete.cases(DEdata1),]
DEdata1[,c(1:2, 5, 8, 12)]
## ensembl_gene_id MIMIC-MIR484INHIB_logFC MIMIC-MIR484INHIB_PValue wikigene_name go_id
## 1: ENSG00000000003 0.3047437 0.4877807 TSPAN6 GO:0016020
## 2: ENSG00000000003 0.3047437 0.4877807 TSPAN6 GO:0016021
## 3: ENSG00000000003 0.3047437 0.4877807 TSPAN6 GO:0004871
## 4: ENSG00000000003 0.3047437 0.4877807 TSPAN6 GO:0005887
## 5: ENSG00000000003 0.3047437 0.4877807 TSPAN6 GO:0005515
## ---
## 639893: ENSG00000284308 -0.2009451 0.7769564 C2orf81
## 639894: ENSG00000284585 -2.6470819 0.1071606 MIR4722
## 639895: ENSG00000284693 -0.9663936 0.5291754 LOC100506022
## 639896: ENSG00000284693 -0.9663936 0.5291754 LOC100506022
## 639897: ENSG00000284693 -0.9663936 0.5291754 LOC100506022
Then, we merged the data together with the raw read counts.
rawdata <- fread("./data/5 countDFeByg.xls")
setnames(rawdata, c(colnames(rawdata)),
c("ensembl_gene_id",colnames(rawdata)[2:length(rawdata)]))
rawmerge <- merge(DEdata, rawdata, by="ensembl_gene_id")
rawmerge <- rawmerge[,c("ensembl_gene_id", "ensembl_transcript_id", "wikigene_name",
"wikigene_description", colnames(rawmerge)[!(colnames(rawmerge) %in%
c("ensembl_gene_id", "ensembl_transcript_id", "wikigene_name",
"wikigene_description", "transcription_start_site", "strand",
"go_id", "definition_1006"))] ), with = FALSE]
Example print out of the first columns of data table:
rawmerge[,c(1, 3, 5, 8)]
## ensembl_gene_id wikigene_name MIMIC-MIR484INHIB_logFC MIMIC-MIR484INHIB_PValue
## 1: ENSG00000000003 TSPAN6 0.30474373 0.4877807
## 2: ENSG00000000419 DPM1 0.08857267 0.7247513
## 3: ENSG00000000457 SCYL3 -0.26894805 0.8080476
## 4: ENSG00000000460 C1orf112 -0.17241237 0.8612456
## 5: ENSG00000000938 FGR 1.32518418 0.3540597
## ---
## 12779: ENSG00000284032 MIR29A 0.23317201 0.8497746
## 12780: ENSG00000284154 MIR3605 -1.43389604 0.3596510
## 12781: ENSG00000284308 C2orf81 -0.20094508 0.7769564
## 12782: ENSG00000284585 MIR4722 -2.64708194 0.1071606
## 12783: ENSG00000284693 LOC100506022 -0.96639357 0.5291754
Eample print out of the last columns of data table:
rawmerge[,c(10:23)]
## MOCK1 CTRLMIMIC1 MIMIC1 CTRLINHIB1 INHIB1 NOTX1
## 1: 363 134 271 457 555 497
## 2: 451 220 285 614 679 415
## 3: 93 34 44 107 127 72
## 4: 152 32 43 166 147 129
## 5: 9 5 14 5 7 5
## ---
## 12779: 0 1 0 0 0 0
## 12780: 0 0 0 1 1 0
## 12781: 162 51 52 218 132 114
## 12782: 0 0 0 0 0 1
## 12783: 13 9 2 19 20 26
In this section, we obtained GEO data of miR-484 overexpression in HUVECs. This data was gnerated using an affymetrix microarray and can be downloaded using the GEO2R workflow. The data is available here: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE66844 and the code to download it is illustrated below.
#### load series and platform data from GEO
gset <- getGEO("GSE66844", GSEMatrix =TRUE, AnnotGPL=FALSE)
if (length(gset) > 1) idx <- grep("GPL15207", attr(gset, "names")) else idx <- 1
gset <- gset[[idx]]
#### make proper column names to match toptable
fvarLabels(gset) <- make.names(fvarLabels(gset))
# group names for all samples
gsms <- "101010"
sml <- c()
for (i in 1:nchar(gsms)) { sml[i] <- substr(gsms,i,i) }
#### subset out desiredsamples (eliminate samples marked as "X") ####
sel <- which(sml != "X")
sml <- sml[sel]
gset <- gset[ ,sel]
#### log2 transform
ex <- exprs(gset)
#### Differential expression analysis with limma ####
qx <- as.numeric(quantile(ex, c(0., 0.25, 0.5, 0.75, 0.99, 1.0), na.rm=T))
LogC <- (qx[5] > 100) ||
(qx[6]-qx[1] > 50 && qx[2] > 0) ||
(qx[2] > 0 && qx[2] < 1 && qx[4] > 1 && qx[4] < 2)
if (LogC) { ex[which(ex <= 0)] <- NaN
exprs(gset) <- log2(ex) }
#### set up the data and proceed with analysis with limma ####
sml <- paste("G", sml, sep="") # set group names
fl <- as.factor(sml)
gset$description <- fl
design <- model.matrix(~ description + 0, gset)
colnames(design) <- levels(fl)
fit <- lmFit(gset, design)
cont.matrix <- makeContrasts(G1-G0, levels=design)
fit2 <- contrasts.fit(fit, cont.matrix)
fit2 <- eBayes(fit2, 0.01)
tT <- topTable(fit2, adjust="fdr", sort.by="B", number=15000000)
The data was then cleaned and transformed into a data table.
#### Subset data, remove entries with no gene name, and transform into a data.table ####
sub <- tT[,c(1:ncol(tT))]
head(sub)
sub <- as.data.table(sub)
sub2 <- sub[,c(1:ncol(sub)), with = FALSE]
sub2 <- sub2[!sub2$Gene.Symbol == "",]
sub2 <- sub2[!sub2$Gene.Symbol == "---",]
sub2[,c(1,15,18,25,28)]
## ID Gene.Symbol Ensembl logFC P.Value
## 1: 11760322_at ZSCAN18 --- -4.714000e-01 0.005471052
## 2: 11750788_a_at CSNK1G1 ENSG00000169118 4.928433e-01 0.008587143
## 3: 11739326_a_at PTPN2 ENSG00000175354 4.904500e-01 0.008630547
## 4: 11759792_at SH3YL1 ENSG00000035115 -4.000500e-01 0.009816660
## 5: 11756563_a_at RCBTB2 ENSG00000136161 -5.084000e-01 0.009973088
## ---
## 48803: 11750654_s_at ITPR1 ENSG00000150995 3.333333e-05 0.999976212
## 48804: 11729898_at C5orf54 ENSG00000221886 -6.666667e-06 0.999984979
## 48805: 11719501_a_at NFATC3 ENSG00000072736 6.666667e-06 0.999985909
## 48806: 11757538_a_at GOT1 ENSG00000120053 -1.000000e-05 0.999987493
## 48807: 11756174_s_at SCD ENSG00000099194 0.000000e+00 1.000000000
In this section, we will combine the in silico miR-93 and miR-484 target prediction with OS/PS and miR-93 or miR-484 RNAseq data. To accomplish this, we will first annotate the OS/PS RNAseq data with predicted targets of miR-93, miR-484, or both miR-93 and miR-484. This information will then be compared to miR-93 or miR-484 overexpression data in the following sections. first miR-target keys were created:
mirtarg <- MiRNAhitstot #fread("./results/27 miRNA target hits HMR conserved multiple species.xls")
mir935p <- mirtarg[mirtarg$Targeting_Factor == "MIR-93-5P",]
mir935p <- unique(mir935p[mir935p$Species == "Homo_sapiens",][,c(5,10), with = FALSE])
mir933p <- mirtarg[mirtarg$Targeting_Factor == "MIR-93-3P",]
mir933p <- unique(mir933p[mir933p$Species == "Homo_sapiens",][,c(5,10), with = FALSE])
mir484 <- mirtarg[mirtarg$Targeting_Factor == "MIR-484",]
mir484 <- unique(mir484[mir484$Species == "Homo_sapiens",][,c(5,10), with = FALSE])
key3p <- mir933p$gene_symbol; key5p <- mir935p$gene_symbol
key484 <- mir484$gene_symbol
p3 <- setdiff(key3p, key5p); p3 <- setdiff(p3, key484)
p5 <- setdiff(key5p, key3p); p5 <- setdiff(p5, key484)
m484 <- setdiff(key484, key5p); m484 <- setdiff(key484, key3p)
p3p5 <- intersect(key3p, key5p); p3p5 <- setdiff(p3p5, key484)
p3m484 <- intersect(key3p, key484); p3m484 <- setdiff(p3m484, key5p)
p5m484<- intersect(key5p, key484); p5m484 <- setdiff(p5m484, key3p)
p3p5m484 <- intersect(key5p, key484); p5m484 <- intersect(p5m484, key3p)
next, these keys were used to annotate the PS/OS RNAseq dataset.
Load the PSOS RNAseq dataset
PSOS <- fread("./data/5 PSOS FC data.xls")
PSOS <- PSOS[!duplicated(PSOS$GeneName),][,c(2,21,22)]
setnames(PSOS, colnames(PSOS), c("Gene.symbol", "PS24-OS24_logFC", "PS24-OS24_PValue"))
DT_fin <- PSOS
DT_fin
## Gene.symbol PS24-OS24_logFC PS24-OS24_PValue
## 1: A2M 0.209494340 0.9153554
## 2: ACADM 0.085649521 0.8633036
## 3: ACADS -0.043946288 0.9498366
## 4: ACADVL 0.368829343 0.4124650
## 5: ACAT1 0.197154902 0.4887729
## ---
## 13116: EEF1E1-MUTED -0.006299207 0.9868662
## 13117: LOC645513 0.049879873 0.9547314
## 13118: LOC100288911 -0.010454722 0.9821938
## 13119: LOC100527964 0.362486676 0.7770170
## 13120: TMX2-CTNND1 0.383143195 0.7912074
Finally, we annotated the PSOS RNAseq dataset with miR-93 and miR-484 predicted targets.
intgene <- DT_fin[DT_fin$`PS24-OS24_PValue` < 0.99,]
p3gene <- intgene[(intgene$Gene.symbol %in% p3),]; p3gene$classification <- "miR-93-3p"
p5gene <- intgene[(intgene$Gene.symbol %in% p5),]; p5gene$classification <- "miR-93-5p"
m484gene <- intgene[(intgene$Gene.symbol %in% m484),]; m484gene$classification <- "miR-484"
p3p5gene <- intgene[(intgene$Gene.symbol %in% p3p5),]; p3p5gene$classification <- "miR-93-3p & miR-93-5p"
p3m484gene <- intgene[(intgene$Gene.symbol %in% p3m484),]; p3m484gene$classification <- "miR-93-3p & miR-484"
p5m484gene <- intgene[(intgene$Gene.symbol %in% p5m484),]; p5m484gene$classification <- "miR-93-5p & miR-484"
p3p5m484gene <- intgene[(intgene$Gene.symbol %in% p3p5m484),]
p3p5m484gene$classification <- "miR-93-3p & miR-93-5p & miR-484"
targs <- rbind(p3gene, p5gene, m484gene, p3p5gene, p3m484gene, p5m484gene, p3p5m484gene)
DT_not <- DT_fin[!(DT_fin$Gene.symbol %in% targs$Gene.symbol),];DT_not$classification <- "Not targeted"
Final <- rbind(targs, DT_not)
Final
## Gene.symbol PS24-OS24_logFC PS24-OS24_PValue classification
## 1: PSEN1 0.024171231 0.97050643 miR-93-3p
## 2: AGA -0.214242712 0.70493347 miR-93-3p
## 3: BCKDHB -0.535026027 0.59976335 miR-93-3p
## 4: CDK4 -0.531227924 0.01165379 miR-93-3p
## 5: FAH -0.503621322 0.23975900 miR-93-3p
## ---
## 13648: EEF1E1-MUTED -0.006299207 0.98686616 Not targeted
## 13649: LOC645513 0.049879873 0.95473142 Not targeted
## 13650: LOC100288911 -0.010454722 0.98219385 Not targeted
## 13651: LOC100527964 0.362486676 0.77701702 Not targeted
## 13652: TMX2-CTNND1 0.383143195 0.79120740 Not targeted
Once the PS/OS datasets were annotated with the predicted miR-93 and miR-484 targets, they were merged with the miR-484 overexpression data.
First we subsetted both the miR-484 fold change data and shear data table (labeled PSOS in the code) to include only targets identified in both data tables. Then the two data tables were merged together.
array484 <- sub2[, c(15, 25, 28)]
array484 <- array484[,.SD[which.min(logFC)],by=list(Gene.Symbol)]
array484 <- array484[(array484$Gene.Symbol %in% Final$Gene.symbol),]
setnames(array484, colnames(array484), c("Gene.symbol", "logFC", "P.Value"))
setnames(PSOS, colnames(PSOS), c("Gene.symbol", "PS24-OS24_logFC", "PS24-OS24_PValue"))
finDT <- merge(PSOS, array484, by = "Gene.symbol")
finDT$`PS24-OS24_logFC` <- finDT$`PS24-OS24_logFC` * -1
setnames(finDT, colnames(finDT), c("Gene.symbol", "OS24-PS24_logFC", "OS24-PS24_PValue", "logFC", "P.Value"))
finDT
## Gene.symbol OS24-PS24_logFC OS24-PS24_PValue logFC P.Value
## 1: A1BG -0.043704994 0.97806237 -0.234666667 0.2455293
## 2: A2LD1 -0.006677226 0.99360186 0.210973333 0.7649909
## 3: A2M -0.209494340 0.91535537 -0.235710000 0.6905679
## 4: A4GALT -0.939078919 0.03600162 -0.067770000 0.8731722
## 5: AAAS 0.055997313 0.95732985 -0.010286667 0.9287602
## ---
## 12035: ZYG11A 0.803083617 0.67979591 -0.240126667 0.2449006
## 12036: ZYG11B -0.221037393 0.73607048 -0.089140000 0.7005601
## 12037: ZYX 0.270695732 0.09659020 -0.160966667 0.4223581
## 12038: ZZEF1 -0.143581918 0.71563471 -0.008093333 0.9428464
## 12039: ZZZ3 0.112509648 0.82455565 -0.220350000 0.3292576
Next, we annotated the resulting data table according to if they were predicted targets of miR-484 or both miR-93 and miR-484.
finDT$target <- "NULL"
targs484 <- Final[Final$classification == "miR-484" |
Final$classification == "miR-93-3p & miR-484" |
Final$classification == "miR-93-5p & miR-484" |
Final$classification == "miR-93-3p & miR-93-5p & miR-484",]
finDT[(finDT$Gene.symbol %in% targs484$Gene.symbol),]$target <- "consensus match"
fintarget <- finDT[finDT$`OS24-PS24_logFC` < 0 & finDT$logFC < 0,]
fintarget$classification <- "miR484 target"
notarget <- finDT[!(finDT$`OS24-PS24_logFC` < 0 & finDT$logFC < 0),]
notarget$classification <- "no target"
finDT <- rbind(fintarget, notarget)
finDT2 <- finDT[!duplicated(finDT$Gene.symbol),]
finDT2
## Gene.symbol OS24-PS24_logFC OS24-PS24_PValue logFC P.Value target classification
## 1: A1BG -0.04370499 0.97806237 -0.23466667 0.2455293 consensus match miR484 target
## 2: A2M -0.20949434 0.91535537 -0.23571000 0.6905679 consensus match miR484 target
## 3: A4GALT -0.93907892 0.03600162 -0.06777000 0.8731722 NULL miR484 target
## 4: AARS -0.00616604 0.98052221 -0.01750667 0.9276419 consensus match miR484 target
## 5: AATK -0.09304133 0.95853392 -0.03175000 0.9519064 consensus match miR484 target
## ---
## 12035: ZWINT 0.07040942 0.95754742 -0.10663333 0.4989971 NULL no target
## 12036: ZXDB 0.03332474 0.96848205 0.18894333 0.7786269 NULL no target
## 12037: ZYG11A 0.80308362 0.67979591 -0.24012667 0.2449006 consensus match no target
## 12038: ZYX 0.27069573 0.09659020 -0.16096667 0.4223581 consensus match no target
## 12039: ZZZ3 0.11250965 0.82455565 -0.22035000 0.3292576 NULL no target
Finally, we generated a plot to visualize results. The point highlighted in red is eNOS.
ggplot(finDT2, aes(`OS24-PS24_logFC`, logFC)) + #, colour = group
geom_point(shape = 21, size = 3, colour = "black", fill = "gray"
) +
theme_pubr() +
labs(x = "OS/PS log2(FC)", y = "miR-484 overexpression log2(FC)") +
scale_color_manual(values=c("#E31A1C")) +
geom_point(data = subset(finDT, classification == "miR484 target"),
aes(x=`OS24-PS24_logFC`, y=logFC, colour = classification),
shape = 21, size = 3, colour = "black", fill = "#08519C") +
geom_point(data = subset(finDT, Gene.symbol == "NOS3"),
aes(x=`OS24-PS24_logFC`, y=logFC), shape = 21, size = 3, colour = "black", fill = "#E31A1C")
Once the PS/OS datasets were annotated with the predicted miR-93 and miR-484 targets, they were merged with the miR-93 overexpression data.
First we subsetted both the miR-93 fold change data and shear data table (labeled PSOS in the code) to include only targets identified in both data tables. Then the two data tables were merged together.
miR93 <- rawmerge[,c(3, 5, 8)]
setnames(miR93, colnames(miR93), c("Gene.symbol", "logFC", "P.Value"))
setnames(PSOS, colnames(PSOS), c("Gene.symbol", "PS24-OS24_logFC", "PS24-OS24_PValue"))
finDT <- merge(PSOS, miR93, by = "Gene.symbol")
finDT$`PS24-OS24_logFC` <- finDT$`PS24-OS24_logFC` * -1
setnames(finDT, colnames(finDT), c("Gene.symbol", "OS24-PS24_logFC", "OS24-PS24_PValue", "logFC", "P.Value"))
finDT
## Gene.symbol OS24-PS24_logFC OS24-PS24_PValue logFC P.Value
## 1: A4GALT -0.93907892 0.03600162 -0.18787262 0.8128323
## 2: AAAS 0.05599731 0.95732985 -0.02209726 0.9551142
## 3: AACS 0.04103272 0.93041781 -0.13603877 0.8438917
## 4: AAGAB 0.16755632 0.63033882 -0.04423256 0.9062805
## 5: AAK1 0.07969348 0.89380589 -0.05550181 0.9178856
## ---
## 10449: ZYG11A 0.80308362 0.67979591 -0.22762969 0.8642075
## 10450: ZYG11B -0.22103739 0.73607048 -0.13578960 0.7038547
## 10451: ZYX 0.27069573 0.09659020 0.27357950 0.3256346
## 10452: ZZEF1 -0.14358192 0.71563471 -0.25029822 0.6948852
## 10453: ZZZ3 0.11250965 0.82455565 0.13418417 0.7326937
Next, we annotated the resulting data table according to if they were predicted targets of miR-93 or both miR-93 and miR-484.
finDT$target <- "NULL"
p35targs <- Final[Final$classification == "miR-93-5p" |
Final$classification == "miR-93-3p" |
Final$classification == "miR-93-3p & miR-93-5p" |
Final$classification == "miR-93-3p & miR-484" |
Final$classification == "miR-93-5p & miR-484" |
Final$classification == "miR-93-3p & miR-93-5p & miR-484",]
finDT[(finDT$Gene.symbol %in% p35targs$Gene.symbol),]$target <- "consensus match"
fintarget <- finDT[finDT$`OS24-PS24_logFC` < 0 & finDT$logFC < 0,]
fintarget$classification <- "miR93 target"
notarget <- finDT[!(finDT$`OS24-PS24_logFC` < 0 & finDT$logFC < 0),]
notarget$classification <- "no target"
finDT <- rbind(fintarget, notarget)
finDT2 <- finDT[!duplicated(finDT$Gene.symbol),]
finDT2
## Gene.symbol OS24-PS24_logFC OS24-PS24_PValue logFC P.Value target classification
## 1: A4GALT -0.93907892 0.03600162 -0.18787262 0.81283228 NULL miR93 target
## 2: AARS -0.00616604 0.98052221 -0.07430367 0.77795499 consensus match miR93 target
## 3: ABCA1 -0.66064342 0.71612309 -0.16097283 0.91175318 consensus match miR93 target
## 4: ABCA2 -1.21810660 0.06897123 -1.06257427 0.13471243 consensus match miR93 target
## 5: ABCA3 -0.69067000 0.37125665 -0.64628215 0.05635528 consensus match miR93 target
## ---
## 10440: ZXDA -0.07751632 0.95563444 0.18839355 0.85671216 NULL no target
## 10441: ZXDB 0.03332474 0.96848205 0.01381112 0.98091095 NULL no target
## 10442: ZYG11A 0.80308362 0.67979591 -0.22762969 0.86420745 consensus match no target
## 10443: ZYX 0.27069573 0.09659020 0.27357950 0.32563462 consensus match no target
## 10444: ZZZ3 0.11250965 0.82455565 0.13418417 0.73269368 NULL no target
Finally, we generated a plot to visualize results. The points highlighted in red are eNOS and KLF2.
ggplot(finDT2, aes(`OS24-PS24_logFC`, logFC)) + #, colour = group
geom_point(shape = 21, size = 3, colour = "black", fill = "gray") +
theme_pubr() +
labs(x = "OS/PS log2(FC)", y = "miR-484 overexpression log2(FC)") +
scale_color_manual(values=c("#E31A1C")) +
geom_point(data = subset(finDT, classification == "miR93 target"),
aes(x=`OS24-PS24_logFC`, y=logFC, colour = classification),
shape = 21, size = 3, colour = "black", fill = "#08519C") +
geom_point(data = subset(finDT, Gene.symbol == "KLF2"),
aes(x=`OS24-PS24_logFC`, y=logFC),
shape = 21, size = 3, colour = "black", fill = "#E31A1C") +
geom_point(data = subset(finDT, Gene.symbol == "NOS3"),
aes(x=`OS24-PS24_logFC`, y=logFC),
shape = 21, size = 3, colour = "black", fill = "#E31A1C")
In this section, we mined the predicted NCL regulated miRs and expression profiles of miRs under PS or OS and to find miRs that are both differentially regulated by shear stress in the endothelium and may require NCL for maturation. This was done in two steps. In the first step, miR expression profiles were downloaded from Gene Expression Omnibus. In the second step, predicted NCL regulated miRs were annotated with the expression profiles and only miRs that were significantly differentially expressed were retained.
This portion of code queries and downloads Gene Expression Omnibibus for miRs that are differentially regulated by shear stress from the GEO accession number GSE26953. The results of this code are illustrated below.
# load series and platform data from GEO
gset <- getGEO("GSE26953", GSEMatrix =TRUE, AnnotGPL=FALSE)
if (length(gset) > 1) idx <- grep("GPL8179", attr(gset, "names")) else idx <- 1
gset <- gset[[idx]]
# make proper column names to match toptable
fvarLabels(gset) <- make.names(fvarLabels(gset))
# group names for all samples
gsms <- "000000XXXXXX111111XXXXXX"
sml <- c()
for (i in 1:nchar(gsms)) { sml[i] <- substr(gsms,i,i) }
# eliminate samples marked as "X"
sel <- which(sml != "X")
sml <- sml[sel]
gset <- gset[ ,sel]
# log2 transform
ex <- exprs(gset)
qx <- as.numeric(quantile(ex, c(0., 0.25, 0.5, 0.75, 0.99, 1.0), na.rm=T))
LogC <- (qx[5] > 100) ||
(qx[6]-qx[1] > 50 && qx[2] > 0) ||
(qx[2] > 0 && qx[2] < 1 && qx[4] > 1 && qx[4] < 2)
if (LogC) { ex[which(ex <= 0)] <- NaN
exprs(gset) <- log2(ex) }
# set up the data and proceed with analysis
sml <- paste("G", sml, sep="") # set group names
fl <- as.factor(sml)
gset$description <- fl
design <- model.matrix(~ description + 0, gset)
colnames(design) <- levels(fl)
fit <- lmFit(gset, design)
cont.matrix <- makeContrasts(G1-G0, levels=design)
fit2 <- contrasts.fit(fit, cont.matrix)
fit2 <- eBayes(fit2, 0.01)
tT <- topTable(fit2, adjust="fdr", sort.by="B", number=2500000000000)
FC <- subset(tT, select=c("ID","adj.P.Val","P.Value","t","B","logFC","miRNA_ID"))
FC
## ID adj.P.Val P.Value t B logFC miRNA_ID
## 1: ILMN_3168684 0.08693411 8.218184e-05 5.895432269 1.3595844 0.7083580037 hsa-miR-26a-1*
## 2: ILMN_3168858 0.08693411 2.004817e-04 -5.322242670 0.6875645 -0.8370094056 hsa-miR-23b*
## 3: ILMN_3167240 0.08693411 2.277750e-04 -5.242563976 0.5893538 -1.2645767136 hsa-miR-486-5p
## 4: ILMN_3168616 0.34886620 1.563259e-03 4.099644770 -0.9449975 0.9039965897 hsa-miR-92a-1*
## 5: ILMN_3168681 0.34886620 1.721573e-03 4.044872832 -1.0240262 0.4359113156 hsa-miR-106b*
## ---
## 1141: ILMN_3168591 0.99445704 9.909830e-01 0.011546184 -5.7536583 0.0027169817 hsa-miR-93*
## 1142: ILMN_3168528 0.99474274 9.921364e-01 0.010069117 -5.7536739 0.0006708245 hsa-miR-562
## 1143: ILMN_3168499 0.99527851 9.935400e-01 0.008271783 -5.7536900 0.0007504537 hsa-miR-518a-5p,hsa-miR-527
## 1144: ILMN_3168756 0.99541842 9.945491e-01 0.006979741 -5.7536997 0.0004231339 hsa-miR-144*
## 1145: ILMN_3168800 0.99719223 9.971922e-01 0.003595222 -5.7537172 0.0025484069 hsa-miR-379*
In this section, shear differentially regulated miR data and predicted NCL regulated miRs were used to determine which miRs that respond to shear may be regulated by NCL. In the first steps of this process, we cleaned the shear regulated miRs data table. This was done in two steps. First the miR names were cleaned.
FC <- fread("./data/36 miR fold changes.xls")
FC <- FC[,c("miRNA_ID","logFC", "P.Value", "adj.P.Val")]
#### reformat miR names
FC$miRNA_ID <- toupper(FC$miRNA_ID)
FC$miRNA_ID <- gsub("HSA-", "", FC$miRNA_ID, fixed = TRUE)
FC$miRNA_ID <- gsub("*", "", FC$miRNA_ID, fixed = TRUE)
FC$miRNA_ID <- gsub("-5P", "", FC$miRNA_ID, fixed = TRUE)
FC$miRNA_ID <- gsub("-3P", "", FC$miRNA_ID, fixed = TRUE)
FC <- FC[!FC$miRNA_ID == "",]
setnames(FC, colnames(FC), c("gene_symbol", "OS/PS_logFC", "P.Value", "adj.P.Val"))
FC
## gene_symbol OS/PS_logFC P.Value adj.P.Val
## 1: MIR-26A-1 0.7083580037 8.218184e-05 0.08693411
## 2: MIR-23B -0.8370094056 2.004817e-04 0.08693411
## 3: MIR-486 -1.2645767136 2.277750e-04 0.08693411
## 4: MIR-92A-1 0.9039965897 1.563259e-03 0.34886620
## 5: MIR-106B 0.4359113156 1.721573e-03 0.34886620
## ---
## 854: MIR-93 0.0027169817 9.909830e-01 0.99445704
## 855: MIR-562 0.0006708245 9.921364e-01 0.99474274
## 856: MIR-518A,MIR-527 0.0007504537 9.935400e-01 0.99527851
## 857: MIR-144 0.0004231339 9.945491e-01 0.99541842
## 858: MIR-379 0.0025484069 9.971922e-01 0.99719223
Then records annotated with two miR names were divided into their own records, each containing the miR name and corresponding information in the data table.
# adjust duplicated names
DT <- NULL
names <- FC$gene_symbol
for(i in 1:length(names)){
spl <- strsplit(names[i], split = ",")
bound <- cbind(data.table(spl[[1]]), FC[i,2:ncol(FC)])
setnames(bound, colnames(bound), c("gene_symbol", "OS/PS_logFC", "P.Value", "adj.P.Val"))
DT <- rbind(DT, bound)
}
FC <- DT
FC
## gene_symbol OS/PS_logFC P.Value adj.P.Val
## 1: MIR-26A-1 0.7083580037 8.218184e-05 0.08693411
## 2: MIR-23B -0.8370094056 2.004817e-04 0.08693411
## 3: MIR-486 -1.2645767136 2.277750e-04 0.08693411
## 4: MIR-92A-1 0.9039965897 1.563259e-03 0.34886620
## 5: MIR-106B 0.4359113156 1.721573e-03 0.34886620
## ---
## 874: MIR-562 0.0006708245 9.921364e-01 0.99474274
## 875: MIR-518A 0.0007504537 9.935400e-01 0.99527851
## 876: MIR-527 0.0007504537 9.935400e-01 0.99527851
## 877: MIR-144 0.0004231339 9.945491e-01 0.99541842
## 878: MIR-379 0.0025484069 9.971922e-01 0.99719223
After the miR fold change data table was cleaned, NCL was subsetted from the predicted miR binding protein data table.
NCL <- AMPK_targs_PSOS[AMPK_targs_PSOS$Species == "Human",c(1,6,10:12)]
NCL <- NCL[NCL$Targeting_Factor == "NCL",]
Finally, to determine which miRs that respond to shear NCL may regulate, miR fold changes were annotated with predicted NCL regulated miRs. In this new data table, miRs were selected with a p-value less than 0.05.
NCL_FC <- merge(FC, NCL, by = "gene_symbol", allow.cartesian = TRUE)
NCL_FC <- NCL_FC[!duplicated(NCL_FC$gene_symbol),]
NCL_FC <- NCL_FC[NCL_FC$P.Value < 0.05,]
NCL_FC
## gene_symbol OS/PS_logFC P.Value adj.P.Val Targeting_Factor Gene_ID Identified_sequence Species
## 1: LET-7E -0.6437136 0.0105411895 0.73652265 NCL ENSG00000115053 GguA Human
## 2: MIR-106B 0.4359113 0.0017215729 0.34886620 NCL ENSG00000115053 AguG Human
## 3: MIR-107 0.2875544 0.0447074587 0.97698082 NCL ENSG00000115053 AguG Human
## 4: MIR-133B -0.2877096 0.0210871293 0.89425048 NCL ENSG00000115053 GguC Human
## 5: MIR-155 0.7752894 0.0341110361 0.97642841 NCL ENSG00000115053 UguU Human
## 6: MIR-187 1.1581162 0.0018281198 0.34886620 NCL ENSG00000115053 GguC Human
## 7: MIR-23B -0.8370094 0.0002004817 0.08693411 NCL ENSG00000115053 GguG Human
## 8: MIR-25 0.4601918 0.0320489675 0.96825357 NCL ENSG00000115053 AguG Human
## 9: MIR-27B -0.6752616 0.0067557295 0.68732332 NCL ENSG00000115053 GguG Human
## 10: MIR-30A -0.2532950 0.0061748468 0.68732332 NCL ENSG00000115053 UguA Human
## 11: MIR-363 -0.2063610 0.0482834035 0.97698082 NCL ENSG00000115053 UguU Human
## 12: MIR-377 -0.1887817 0.0189578268 0.89425048 NCL ENSG00000115053 GguU Human
## 13: MIR-452 0.6215602 0.0187447464 0.89425048 NCL ENSG00000115053 UguU Human
## 14: MIR-505 -0.6893031 0.0276553744 0.96825357 NCL ENSG00000115053 AguG Human
## 15: MIR-582 0.5588665 0.0244112128 0.96382202 NCL ENSG00000115053 UguG Human
## 16: MIR-877 0.3985299 0.0067659924 0.68732332 NCL ENSG00000115053 guA Human
## 17: MIR-92A-1 0.9039966 0.0015632590 0.34886620 NCL ENSG00000115053 GguU Human
In this section, the serum levels of miR-93-5p, miR-93-3p, and miR-484 were determined in people with coronary artery disease (CAD) or healthy control individuals (CTRL). This process was done in several steps as putlined below.
cohort1 <- fread("./data/2019-1-12 combined taiwan data WSC 0113 update.csv")
cohort1 <- cohort1[,c(3:7)]
head(cohort1)
## mir39_CT mir93-3p_CT mir93-5p_CT mir484_CT group
## 1: 24.57745 33.88671 30.31115 31.77536 CTRL
## 2: 27.41441 34.92678 34.30077 32.35014 CTRL
## 3: 25.23070 33.75857 30.92392 32.05500 CTRL
## 4: 23.60611 31.35246 27.55444 29.50591 CTRL
## 5: 22.56708 31.24483 26.91515 30.20167 CTRL
## 6: 24.37048 34.29940 29.35328 31.85543 CTRL
cohort2 <- fread("./data/2018-12-12 xian data.csv")
setnames(cohort2, colnames(cohort2), c("Sample Name", "mir39_CT", "mir93-3p_CT", "mir93-5p_CT", "mir484_CT", "group"))
cohort2 <- cohort2[,c(2:6)]
head(cohort2)
## mir39_CT mir93-3p_CT mir93-5p_CT mir484_CT group
## 1: 21.10 33.76 31.63 28.21 CTRL
## 2: 20.97 35.19 31.59 27.93 CTRL
## 3: 20.12 34.55 31.38 28.13 CTRL
## 4: 19.75 34.76 31.68 28.94 CTRL
## 5: 21.14 33.40 31.45 28.04 CTRL
## 6: 20.51 32.58 29.72 26.87 CTRL
In order to “normalize” fold changes to the reference gene, we used the delta CT method.
cohort1$delta_ct_93_3p <- 2^(cohort1$mir39_CT - cohort1$`mir93-3p_CT`)
cohort1$delta_ct_93_5p <- 2^(cohort1$mir39_CT - cohort1$`mir93-5p_CT`)
cohort1$delta_ct_484 <- 2^(cohort1$mir39_CT - cohort1$mir484_CT)
cohort1 <- cohort1[,c(1:4, 6:8, 5)]
head(cohort1)
## mir39_CT mir93-3p_CT mir93-5p_CT mir484_CT delta_ct_93_3p delta_ct_93_5p delta_ct_484 group
## 1: 24.57745 33.88671 30.31115 31.77536 0.001576274 0.018792462 0.006811006 CTRL
## 2: 27.41441 34.92678 34.30077 32.35014 0.005477117 0.008452767 0.032673609 CTRL
## 3: 25.23070 33.75857 30.92392 32.05500 0.002709284 0.019327266 0.008824273 CTRL
## 4: 23.60611 31.35246 27.55444 29.50591 0.004657115 0.064779157 0.016748741 CTRL
## 5: 22.56708 31.24483 26.91515 30.20167 0.002441945 0.049102262 0.005032229 CTRL
## 6: 24.37048 34.29940 29.35328 31.85543 0.001025875 0.031624673 0.005582182 CTRL
cohort2$delta_ct_93_3p <- 2^(cohort2$mir39_CT - cohort2$`mir93-3p_CT`)
cohort2$delta_ct_93_5p <- 2^(cohort2$mir39_CT - cohort2$`mir93-5p_CT`)
cohort2$delta_ct_484 <- 2^(cohort2$mir39_CT - cohort2$mir484_CT)
cohort2 <- cohort2[,c(1:4, 6:8, 5)]
head(cohort2)
## mir39_CT mir93-3p_CT mir93-5p_CT mir484_CT delta_ct_93_3p delta_ct_93_5p delta_ct_484 group
## 1: 21.10 33.76 31.63 28.21 1.545113e-04 0.0006763230 0.007238969 CTRL
## 2: 20.97 35.19 31.59 27.93 5.240268e-05 0.0006354208 0.008032139 CTRL
## 3: 20.12 34.55 31.38 28.13 4.530406e-05 0.0004077578 0.003879268 CTRL
## 4: 19.75 34.76 31.68 28.94 3.030678e-05 0.0002562785 0.001712121 CTRL
## 5: 21.14 33.40 31.45 28.04 2.038789e-04 0.0007877361 0.008373230 CTRL
## 6: 20.51 32.58 29.72 26.87 2.325776e-04 0.0016885493 0.012174447 CTRL
Fold Changes were calculated based on the average value of the control sample for each miR. This was done in two steps. In the first step, the average delta CT value was calculated for each miR in the control samples. In the second step, the fold changes were computed based on this average value.
ctrl <- cohort1[cohort1$group == "CTRL",]
ctrl <- ctrl[complete.cases(ctrl)]
ave3p <- ave(ctrl$delta_ct_93_3p)[1]
ave5p <- ave(ctrl$delta_ct_93_5p)[1]
ave484 <- ave(ctrl$delta_ct_484)[1]
#### compute fold changes ####
cohort1$FC3p <- cohort1$delta_ct_93_3p/ave3p
cohort1$FC5p <- cohort1$delta_ct_93_5p/ave5p
cohort1$FC484 <- cohort1$delta_ct_484/ave484
cohort1 <- cohort1[,c(5:7, 9:11, 8)]
head(cohort1)
## delta_ct_93_3p delta_ct_93_5p delta_ct_484 FC3p FC5p FC484 group
## 1: 0.001576274 0.018792462 0.006811006 0.6458062 1.8639916 0.3888060 CTRL
## 2: 0.005477117 0.008452767 0.032673609 2.2439975 0.8384152 1.8651718 CTRL
## 3: 0.002709284 0.019327266 0.008824273 1.1100050 1.9170378 0.5037333 CTRL
## 4: 0.004657115 0.064779157 0.016748741 1.9080393 6.4253316 0.9561013 CTRL
## 5: 0.002441945 0.049102262 0.005032229 1.0004748 4.8703677 0.2872646 CTRL
## 6: 0.001025875 0.031624673 0.005582182 0.4203054 3.1367962 0.3186586 CTRL
ctrl <- cohort2[cohort2$group == "CTRL",]
ctrl <- ctrl[complete.cases(ctrl)]
ave3p <- ave(ctrl$delta_ct_93_3p)[1]
ave5p <- ave(ctrl$delta_ct_93_5p)[1]
ave484 <- ave(ctrl$delta_ct_484)[1]
#### compute fold changes ####
cohort2$FC3p <- cohort2$delta_ct_93_3p/ave3p
cohort2$FC5p <- cohort2$delta_ct_93_5p/ave5p
cohort2$FC484 <- cohort2$delta_ct_484/ave484
cohort2 <- cohort2[,c(5:7, 9:11, 8)]
head(cohort2)
## delta_ct_93_3p delta_ct_93_5p delta_ct_484 FC3p FC5p FC484 group
## 1: 1.545113e-04 0.0006763230 0.007238969 1.9620825 0.9071049 1.3854664 CTRL
## 2: 5.240268e-05 0.0006354208 0.008032139 0.6654424 0.8522456 1.5372713 CTRL
## 3: 4.530406e-05 0.0004077578 0.003879268 0.5752997 0.5468971 0.7424531 CTRL
## 4: 3.030678e-05 0.0002562785 0.001712121 0.3848547 0.3437285 0.3276828 CTRL
## 5: 2.038789e-04 0.0007877361 0.008373230 2.5889834 1.0565355 1.6025527 CTRL
## 6: 2.325776e-04 0.0016885493 0.012174447 2.9534183 2.2647334 2.3300675 CTRL
This data was not normally distributed. Therefore, we calculated the log2 of the fold changes to create a more “normal” distribution of the data.
cohort1$log2FC_93_3p <- log2(cohort1$FC3p)
cohort1$log2FC_93_5p <- log2(cohort1$FC5p)
cohort1$log2FC_484 <- log2(cohort1$FC484)
cohort1 <- cohort1[,c(1:6,8:10, 7)]
head(cohort1)
## delta_ct_93_3p delta_ct_93_5p delta_ct_484 FC3p FC5p FC484 log2FC_93_3p log2FC_93_5p log2FC_484 group
## 1: 0.001576274 0.018792462 0.006811006 0.6458062 1.8639916 0.3888060 -0.6308268124 0.8983953 -1.36287756 CTRL
## 2: 0.005477117 0.008452767 0.032673609 2.2439975 0.8384152 1.8651718 1.1660710776 -0.2542631 0.89930849 CTRL
## 3: 0.002709284 0.019327266 0.008824273 1.1100050 1.9170378 0.5037333 0.1505662376 0.9388788 -0.98926802 CTRL
## 4: 0.004657115 0.064779157 0.016748741 1.9080393 6.4253316 0.9561013 0.9320909076 2.6837709 -0.06476469 CTRL
## 5: 0.002441945 0.049102262 0.005032229 1.0004748 4.8703677 0.2872646 0.0006848776 2.2840307 -1.79954787 CTRL
## 6: 0.001025875 0.031624673 0.005582182 0.4203054 3.1367962 0.3186586 -1.2504900424 1.6492918 -1.64991636 CTRL
cohort2$log2FC_93_3p <- log2(cohort2$FC3p)
cohort2$log2FC_93_5p <- log2(cohort2$FC5p)
cohort2$log2FC_484 <- log2(cohort2$FC484)
cohort2 <- cohort2[,c(1:6,8:10, 7)]
head(cohort2)
## delta_ct_93_3p delta_ct_93_5p delta_ct_484 FC3p FC5p FC484 log2FC_93_3p log2FC_93_5p log2FC_484 group
## 1: 1.545113e-04 0.0006763230 0.007238969 1.9620825 0.9071049 1.3854664 0.9723857 -0.14065877 0.4703718 CTRL
## 2: 5.240268e-05 0.0006354208 0.008032139 0.6654424 0.8522456 1.5372713 -0.5876143 -0.23065877 0.6203718 CTRL
## 3: 4.530406e-05 0.0004077578 0.003879268 0.5752997 0.5468971 0.7424531 -0.7976143 -0.87065877 -0.4296282 CTRL
## 4: 3.030678e-05 0.0002562785 0.001712121 0.3848547 0.3437285 0.3276828 -1.3776143 -1.54065877 -1.6096282 CTRL
## 5: 2.038789e-04 0.0007877361 0.008373230 2.5889834 1.0565355 1.6025527 1.3723857 0.07934123 0.6803718 CTRL
## 6: 2.325776e-04 0.0016885493 0.012174447 2.9534183 2.2647334 2.3300675 1.5623857 1.17934123 1.2203718 CTRL
Graphs for cohort 1:
DT2 <- cohort1[,c(7:10)]
DT_melt <- melt(DT2, id.vars = c("group"),measure.vars = c("log2FC_93_3p", "log2FC_93_5p", "log2FC_484"))
DT_melt <- DT_melt[DT_melt$value < 100,]
p1 <- ggplot(DT_melt, aes(x= `variable`, y= `value`, fill = group)) +
geom_dotplot(binaxis='y', stackdir='center',
position=position_dodge(0.8), binwidth = 0.2) +
ylab("Fold change") +
xlab("microRNA") +
ggtitle("cohort1") +
theme_pubr()
p2 <- ggplot(DT_melt, aes(x= `variable`, y= `value`, fill = group)) +
geom_boxplot(position=position_dodge(0.8)) +
ylab("Fold change") +
xlab("microRNA") +
ggtitle("cohort1") +
theme_pubr()
p1;p2
Graphs for cohort 2:
DT2 <- cohort2[,c(7:10)]
DT_melt <- melt(DT2, id.vars = c("group"),measure.vars = c("log2FC_93_3p", "log2FC_93_5p", "log2FC_484"))
DT_melt <- DT_melt[DT_melt$value < 100,]
p1 <- ggplot(DT_melt, aes(x= `variable`, y= `value`, fill = group)) +
geom_dotplot(binaxis='y', stackdir='center',
position=position_dodge(0.8), binwidth = 0.2) +
ylab("Fold change") +
xlab("microRNA") +
ggtitle("cohort2") +
theme_pubr()
p2 <- ggplot(DT_melt, aes(x= `variable`, y= `value`, fill = group)) +
geom_boxplot(position=position_dodge(0.8)) +
ylab("Fold change") +
xlab("microRNA") +
ggtitle("cohort2") +
theme_pubr()
p1;p2
In this section, we compared the control to the treatment groupd for each miR. Because the data was not normally distributed, we used Kruskal-Wallis test followed by a Mann-Whitney U non-parametric tests.
Stats for cohort 1:
CTRL <- cohort1[cohort1$group == "CTRL",]
CAD <- cohort1[cohort1$group == "CAD",]
wilcox.test(CTRL$log2FC_93_3p ,CAD$log2FC_93_3p, conf.int = TRUE,
paired = FALSE, formula = "lhs")
##
## Wilcoxon rank sum test
##
## data: CTRL$log2FC_93_3p and CAD$log2FC_93_3p
## W = 933, p-value = 0.4489
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
## -1.3456787 0.5814148
## sample estimates:
## difference in location
## -0.3927225
wilcox.test(CTRL$log2FC_93_5p ,CAD$log2FC_93_5p, conf.int = TRUE,
paired = FALSE, formula = "lhs")
##
## Wilcoxon rank sum test
##
## data: CTRL$log2FC_93_5p and CAD$log2FC_93_5p
## W = 591, p-value = 0.0002325
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
## -3.1296677 -0.9367714
## sample estimates:
## difference in location
## -2.134306
wilcox.test(CTRL$log2FC_484 ,CAD$log2FC_484, conf.int = TRUE,
paired = FALSE, formula = "lhs")
##
## Wilcoxon rank sum test
##
## data: CTRL$log2FC_484 and CAD$log2FC_484
## W = 561, p-value = 8.39e-05
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
## -3.671774 -1.307429
## sample estimates:
## difference in location
## -2.697787
stats for cohort 2:
CTRL <- cohort2[cohort2$group == "CTRL",]
CAD <- cohort2[cohort2$group == "CAD",]
wilcox.test(CTRL$log2FC_93_3p ,CAD$log2FC_93_3p, conf.int = TRUE,
paired = FALSE, formula = "lhs")
##
## Wilcoxon rank sum test
##
## data: CTRL$log2FC_93_3p and CAD$log2FC_93_3p
## W = 296, p-value = 0.005525
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
## -1.87 -0.32
## sample estimates:
## difference in location
## -1.12
wilcox.test(CTRL$log2FC_93_5p ,CAD$log2FC_93_5p, conf.int = TRUE,
paired = FALSE, formula = "lhs")
##
## Wilcoxon rank sum test with continuity correction
##
## data: CTRL$log2FC_93_5p and CAD$log2FC_93_5p
## W = 140, p-value = 1.021e-06
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
## -2.50006 -1.17003
## sample estimates:
## difference in location
## -1.914086
wilcox.test(CTRL$log2FC_484 ,CAD$log2FC_484, conf.int = TRUE,
paired = FALSE, formula = "lhs")
##
## Wilcoxon rank sum test with continuity correction
##
## data: CTRL$log2FC_484 and CAD$log2FC_484
## W = 186.5, p-value = 2.155e-05
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
## -2.0199425 -0.8300589
## sample estimates:
## difference in location
## -1.376852
cohort1$dataset <- "cohort1"
cohort2$dataset <- "cohort2"
combined <- rbind(cohort1, cohort2)
combined2 <- combined[,7:11]
combined_comp <- combined2[complete.cases(combined2),]
head(combined[,c(7:11)])
## log2FC_93_3p log2FC_93_5p log2FC_484 group dataset
## 1: -0.6308268124 0.8983953 -1.36287756 CTRL cohort1
## 2: 1.1660710776 -0.2542631 0.89930849 CTRL cohort1
## 3: 0.1505662376 0.9388788 -0.98926802 CTRL cohort1
## 4: 0.9320909076 2.6837709 -0.06476469 CTRL cohort1
## 5: 0.0006848776 2.2840307 -1.79954787 CTRL cohort1
## 6: -1.2504900424 1.6492918 -1.64991636 CTRL cohort1
p <- ggplot(combined, aes(x=log2FC_93_3p, y=log2FC_93_5p))+
geom_point(shape = 21, size = 3, colour = "black", fill = "#08519C")+
scale_color_manual(values=c("#E31A1C")) +
geom_point(data = subset(combined, combined$dataset == "cohort2"),
aes(x=log2FC_93_3p, y=log2FC_93_5p),
shape = 21, size = 3, colour = "black", fill = "#E31A1C") +
geom_smooth(method = "lm")+
ggtitle("miR-93-5p miR-93-3p correlation")+
theme_pubr(); p
cor.test(combined_comp$log2FC_93_3p, combined_comp$log2FC_93_5p, method=c("pearson"))
##
## Pearson's product-moment correlation
##
## data: combined_comp$log2FC_93_3p and combined_comp$log2FC_93_5p
## t = 11.198, df = 152, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5753789 0.7506865
## sample estimates:
## cor
## 0.6723538
p <- ggplot(data = combined, mapping = aes(x=combined$log2FC_93_3p, y=combined$log2FC_484))+
geom_point(shape = 21, size = 3, colour = "black", fill = "#08519C")+
geom_smooth(method = "lm")+
geom_point(data = subset(combined, combined$dataset == "cohort2"),
aes(x=log2FC_93_3p, y=log2FC_484),
shape = 21, size = 3, colour = "black", fill = "#E31A1C") +
ggtitle("miR-484 miR-93-3p correlation")+
theme_pubr(); p
cor.test(combined_comp$log2FC_93_3p, combined_comp$log2FC_484, method=c("pearson"))
##
## Pearson's product-moment correlation
##
## data: combined_comp$log2FC_93_3p and combined_comp$log2FC_484
## t = 12.693, df = 152, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6307238 0.7862817
## sample estimates:
## cor
## 0.717327
p <- ggplot(data = combined, mapping = aes(x=log2FC_93_5p, y=log2FC_484))+
geom_point(shape = 21, size = 3, colour = "black", fill = "#08519C")+
geom_point(data = subset(combined, combined$dataset == "cohort2"),
aes(x=log2FC_93_5p, y=log2FC_484),
shape = 21, size = 3, colour = "black", fill = "#E31A1C") +
geom_smooth(method = "lm")+
ggtitle("miR-484 miR-93-5p correlation")+
theme_pubr(); p
cor.test(combined_comp$log2FC_93_5p, combined_comp$log2FC_484, method=c("pearson"))
##
## Pearson's product-moment correlation
##
## data: combined_comp$log2FC_93_5p and combined_comp$log2FC_484
## t = 14.749, df = 152, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6932165 0.8252700
## sample estimates:
## cor
## 0.7672556