Uterine Leiomyoma GSE128242 15 women ECM, COL4A5, COL4A6, and MED12

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

In the interest of my personal time, AKA being lazy to get to the point, here is the basis of this analysis. I want to add the MeD12, COL4A5, COL4A6, HMGA2, and ECM or extracellular matrix genes from genecards.org to our developed body systems genes used in previous analysis of COVID-19, Ulcerative Colitis, Crohn’s Disease, and Rheumatoid Arthritis. This analysis is on uterine leiomyomas or ULs that impact many females and are responsible for 30% of all hysterectomies, as well as other discomforts by women. They are benign tumors and this study, explores the extracellular matrix or ECM genes and collagen genes and hormone genes of a certain type of UL in women. There were 15 women who were filtered from various hysterectomy samples that took a matched healty myometrium sample with a diseased state tissue sample of UL, but they specifically had a mutation in a transcription process of the 2nd exon among other specifications to do this study. The details in the referenced gene summary and article it refers to are below.

“In this study, freshly procured tissue samples from women who have undergone hysterectomies as a course of treatment for uterine leiomyomas, confirmed to have a glycine-to-aspartate (G44D) or glycine-to-serine (G44S) substitution in exon 2 of MED12, are used to characterize epigenetic changes in the disease. Adjacent, nondiseased areas of the myometrium from the same patients are also collected to represent normal (wild-type [WT]) samples. In an effort to avoid artifactual alterations to the transcriptomic and epigenomic profiles of patient samples, gene expression profiling by RNA-sequencing (RNA-seq) as well as epigenetic profiling by highresolution chromatin immunoprecipitation-sequencing (ChIP-seq) and promoter capture Hi-C are performed directly from tissue samples with minimal processing. Our integrative analysis of transcriptomic and epigenetic changes, highlighted by the near-native characterization of long-range promoter interactions in uterine fibroids, identifies differential transcription factor occupancy, differential enhancer engagement, and altered enhancer-promoter contacts as key events that drive gene dysregulation in leiomyomas. Results Transcriptome profiling of fibroids. We used RNA isolation followed by massively parallel sequencing (RNA-seq) to examine the transcriptome profiles of normal myometrium (WT) and matched leiomyoma (G44D/S) tissue obtained from 15 women. A high degree of similarity between biological replicates of myometrium transcriptome profiles was seen, with a similar observation among biological replicates of leiomyoma tissue samples. Hierarchical clustering of all RNA-seq datasets highlights clustering primarily by disease state (Fig. 1a). Significantly, principal component analysis of the most variable genes revealed that 43% of the variance (PC1) is explained by the disease state, with biological replicates co-segregating based on tissue type (Fig. 1b). This suggests that the changes in gene expression between normal and MED12 mutant disease tissue types are primarily attributable to biological pathways that are important for the” – GSE128242 NCBI study from the article associated with this study:

Series GSE128242 Query DataSets for GSE128242 Status Public on Jan 01, 2020 Title Altered chromatin landscape and enhancer engagement underlie transcriptional dysregulation in MED12 mutant uterine leiomyomas Organism Homo sapiens Experiment type Expression profiling by high throughput sequencing Genome binding/occupancy profiling by high throughput sequencing Other Summary This SuperSeries is composed of the SubSeries listed below.

Overall design Refer to individual Series

Citation(s) Moyo MB, Parker JB, Chakravarti D. Altered chromatin landscape and enhancer engagement underlie transcriptional dysregulation in MED12 mutant uterine leiomyomas. Nat Commun 2020 Feb 24;11(1):1019. PMID: 32094355 Submission date Mar 13, 2019 Last update date Mar 09, 2020 Contact name Debabrata Chakravarti E-mail(s) debu@northwestern.edu

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Lets use our functions for gathering the genecards.org genes related to the above mentioned genes of the ECM, COL4A5, COL4A6, MED12, and HMGA2.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)

source('geneCards2.R')

bodySystemGenes <- read.csv('bodySystemGenes.csv')
head(bodySystemGenes)

Lets just keep the first 5 features. The other features are from an Ulcerative Colitis and Crohn’s Disease analysis in GSE135223.

BS1 <- bodySystemGenes[,1:5]

find25genes('ECM')
find25genes('hormone')
find25genes('collagen')
find25genes('transcription')

getProteinGenes('ECM')
getProteinGenes('hormone')
getProteinGenes('collagen')
getProteinGenes('transcription')

ecm <- read.csv("Top25ecms.csv")
collagen <- read.csv("Top25hormones.csv")
hormone <- read.csv("Top25collagens.csv")
transcription <- read.csv("Top25transcriptions.csv")

for (i in ecm$proteinType){
  getSummaries2(i,'ECM')
}

for (i in collagen$proteinType){
  getSummaries2(i,'collagen')
}

for (i in hormone$proteinType){
  getSummaries2(i,'hormone')
}

for (i in transcription$proteinType){
  getSummaries2(i,'transcription')
}

find25genes('HMGA2')
find25genes('MED12')
find25genes('COL4A6')
find25genes('COL4A5')

hmga2 <- read.csv("Top25hmga2s.csv")
med12 <- read.csv('Top25med12s.csv')
col4a5 <- read.csv('Top25col4a5s.csv')
col4a6 <- read.csv('Top25col4a6s.csv')

for (i in hmga2$proteinType){
  getSummaries2(i,'hmga2')
}

for (i in med12$proteinType){
  getSummaries2(i,'med12')
}

for (i in col4a5$proteinType){
  getSummaries2(i, 'col4a5')
}

for (i in col4a6$proteinType){
  getSummaries2(i,'col4a6')
}

getGeneSummaries('ECM')
getGeneSummaries('collagen')
getGeneSummaries('hormone')
getGeneSummaries('transcription')
getGeneSummaries('HMGA2')
getGeneSummaries('MED12')
getGeneSummaries('COL4A5')
getGeneSummaries('COL4A6')

transcriptionSumms <- read.csv("proteinGeneSummaries_transcription.csv")
hormoneSumms <- read.csv("proteinGeneSummaries_hormone.csv")
collagenSumms <- read.csv("proteinGeneSummaries_collagen.csv")
ecmSumms <- read.csv("proteinGeneSummaries_ecm.csv")
med12Summs <- read.csv("proteinGeneSummaries_med12.csv")
col4a5Summs <- read.csv("proteinGeneSummaries_col4a5.csv")
col4a6Summs <- read.csv("proteinGeneSummaries_col4a6.csv")
hmga2Summs <- read.csv("proteinGeneSummaries_hmga2.csv")

newSumms <- rbind(transcriptionSumms,hormoneSumms,
                  collagenSumms,ecmSumms,
                  med12Summs,col4a5Summs,col4a6Summs,
                  hmga2Summs)
colnames(newSumms)
NS <- newSumms[,c(2,1,4:6)]

bodySystems2 <- rbind(BS1,NS)

write.csv(bodySystems2,'bodySystems2.csv',row.names=F)

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

We now have our genes that we are going to be working with. The other genes only use the top 3 ranked genes related to those systems of the body, but these new additions related to UL and this study resource have 25 top ranked genes for each of our 8 gene systems related to collagen, hormones, transcription, and the extracellular matrix. There are a few duplicates, of COL4A5 and one other that were researched for top genes, because they weren’t the top 3 ranked genes of those other mentioned body system genes, and some weren’t even the top25 ranked genes for those body systems like COL4A6 and HMGA2. Lets clean out our environment in Rstudio and read in our new body system genes.

bodySystems2 <- read.csv('bodySystems2.csv',stringsAsFactors = T)

Lets now read in our UL data from GSE128242

UL <- read.delim('GSE128229_RNA_Tissue_DESeq2.txt',sep='\t',header=T)

DF <- merge(bodySystems2,UL,by.x='gene',by.y='symbol')

head(DF)

##    gene proteinSearched
## 1 AANAT       melatonin
## 2 AANAT       melatonin
## 3 AANAT       melatonin
## 4 AANAT       melatonin
## 5 AANAT       melatonin
## 6 AANAT       melatonin
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 EntrezSummary
## 1 The protein encoded by this gene belongs to the acetyltransferase superfamily. It is the penultimate enzyme in melatonin synthesis and controls the night/day rhythm in melatonin production in the vertebrate pineal gland. Melatonin is essential for the function of the circadian clock that influences activity and sleep. This enzyme is regulated by cAMP-dependent phosphorylation that promotes its interaction with 14-3-3 proteins and thus protects the enzyme against proteasomal degradation. This gene may contribute to numerous genetic diseases such as delayed sleep phase syndrome. Alternatively spliced transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Oct 2009]
## 2 The protein encoded by this gene belongs to the acetyltransferase superfamily. It is the penultimate enzyme in melatonin synthesis and controls the night/day rhythm in melatonin production in the vertebrate pineal gland. Melatonin is essential for the function of the circadian clock that influences activity and sleep. This enzyme is regulated by cAMP-dependent phosphorylation that promotes its interaction with 14-3-3 proteins and thus protects the enzyme against proteasomal degradation. This gene may contribute to numerous genetic diseases such as delayed sleep phase syndrome. Alternatively spliced transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Oct 2009]
## 3 The protein encoded by this gene belongs to the acetyltransferase superfamily. It is the penultimate enzyme in melatonin synthesis and controls the night/day rhythm in melatonin production in the vertebrate pineal gland. Melatonin is essential for the function of the circadian clock that influences activity and sleep. This enzyme is regulated by cAMP-dependent phosphorylation that promotes its interaction with 14-3-3 proteins and thus protects the enzyme against proteasomal degradation. This gene may contribute to numerous genetic diseases such as delayed sleep phase syndrome. Alternatively spliced transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Oct 2009]
## 4 The protein encoded by this gene belongs to the acetyltransferase superfamily. It is the penultimate enzyme in melatonin synthesis and controls the night/day rhythm in melatonin production in the vertebrate pineal gland. Melatonin is essential for the function of the circadian clock that influences activity and sleep. This enzyme is regulated by cAMP-dependent phosphorylation that promotes its interaction with 14-3-3 proteins and thus protects the enzyme against proteasomal degradation. This gene may contribute to numerous genetic diseases such as delayed sleep phase syndrome. Alternatively spliced transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Oct 2009]
## 5 The protein encoded by this gene belongs to the acetyltransferase superfamily. It is the penultimate enzyme in melatonin synthesis and controls the night/day rhythm in melatonin production in the vertebrate pineal gland. Melatonin is essential for the function of the circadian clock that influences activity and sleep. This enzyme is regulated by cAMP-dependent phosphorylation that promotes its interaction with 14-3-3 proteins and thus protects the enzyme against proteasomal degradation. This gene may contribute to numerous genetic diseases such as delayed sleep phase syndrome. Alternatively spliced transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Oct 2009]
## 6 The protein encoded by this gene belongs to the acetyltransferase superfamily. It is the penultimate enzyme in melatonin synthesis and controls the night/day rhythm in melatonin production in the vertebrate pineal gland. Melatonin is essential for the function of the circadian clock that influences activity and sleep. This enzyme is regulated by cAMP-dependent phosphorylation that promotes its interaction with 14-3-3 proteins and thus protects the enzyme against proteasomal degradation. This gene may contribute to numerous genetic diseases such as delayed sleep phase syndrome. Alternatively spliced transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Oct 2009]
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         GeneCardsSummary
## 1                                             AANAT (Aralkylamine N-Acetyltransferase) is a Protein Coding gene.                                            Diseases associated with AANAT include Dissociative Amnesia and Baastrup's Syndrome.                                            Among its related pathways are superpathway of tryptophan utilization and Tryptophan metabolism.                                            Gene Ontology (GO) annotations related to this gene include N-acetyltransferase activity and arylamine N-acetyltransferase activity.                                            
## 2                                             AANAT (Aralkylamine N-Acetyltransferase) is a Protein Coding gene.                                            Diseases associated with AANAT include Dissociative Amnesia and Baastrup's Syndrome.                                            Among its related pathways are superpathway of tryptophan utilization and Tryptophan metabolism.                                            Gene Ontology (GO) annotations related to this gene include N-acetyltransferase activity and arylamine N-acetyltransferase activity.                                            
## 3                                             AANAT (Aralkylamine N-Acetyltransferase) is a Protein Coding gene.                                            Diseases associated with AANAT include Dissociative Amnesia and Baastrup's Syndrome.                                            Among its related pathways are superpathway of tryptophan utilization and Tryptophan metabolism.                                            Gene Ontology (GO) annotations related to this gene include N-acetyltransferase activity and arylamine N-acetyltransferase activity.                                            
## 4                                             AANAT (Aralkylamine N-Acetyltransferase) is a Protein Coding gene.                                            Diseases associated with AANAT include Dissociative Amnesia and Baastrup's Syndrome.                                            Among its related pathways are superpathway of tryptophan utilization and Tryptophan metabolism.                                            Gene Ontology (GO) annotations related to this gene include N-acetyltransferase activity and arylamine N-acetyltransferase activity.                                            
## 5                                             AANAT (Aralkylamine N-Acetyltransferase) is a Protein Coding gene.                                            Diseases associated with AANAT include Dissociative Amnesia and Baastrup's Syndrome.                                            Among its related pathways are superpathway of tryptophan utilization and Tryptophan metabolism.                                            Gene Ontology (GO) annotations related to this gene include N-acetyltransferase activity and arylamine N-acetyltransferase activity.                                            
## 6                                             AANAT (Aralkylamine N-Acetyltransferase) is a Protein Coding gene.                                            Diseases associated with AANAT include Dissociative Amnesia and Baastrup's Syndrome.                                            Among its related pathways are superpathway of tryptophan utilization and Tryptophan metabolism.                                            Gene Ontology (GO) annotations related to this gene include N-acetyltransferase activity and arylamine N-acetyltransferase activity.                                            
##                                                                                                                                                                                                                                                        UniProtKB_Summary
## 1 Controls the night/day rhythm of melatonin production in the pineal gland. Catalyzes the N-acetylation of serotonin into N-acetylserotonin, the penultimate step in the synthesis of melatonin.\n                         SNAT_HUMAN,Q16613\n                         
## 2 Controls the night/day rhythm of melatonin production in the pineal gland. Catalyzes the N-acetylation of serotonin into N-acetylserotonin, the penultimate step in the synthesis of melatonin.\n                         SNAT_HUMAN,Q16613\n                         
## 3 Controls the night/day rhythm of melatonin production in the pineal gland. Catalyzes the N-acetylation of serotonin into N-acetylserotonin, the penultimate step in the synthesis of melatonin.\n                         SNAT_HUMAN,Q16613\n                         
## 4 Controls the night/day rhythm of melatonin production in the pineal gland. Catalyzes the N-acetylation of serotonin into N-acetylserotonin, the penultimate step in the synthesis of melatonin.\n                         SNAT_HUMAN,Q16613\n                         
## 5 Controls the night/day rhythm of melatonin production in the pineal gland. Catalyzes the N-acetylation of serotonin into N-acetylserotonin, the penultimate step in the synthesis of melatonin.\n                         SNAT_HUMAN,Q16613\n                         
## 6 Controls the night/day rhythm of melatonin production in the pineal gland. Catalyzes the N-acetylation of serotonin into N-acetylserotonin, the penultimate step in the synthesis of melatonin.\n                         SNAT_HUMAN,Q16613\n                         
##            GeneID  baseMean log2FoldChange     lfcSE    pvalue      padj
## 1 ENSG00000129673 0.7401816      0.2740803 0.4811392 0.2439541 0.4266912
## 2 ENSG00000129673 0.7401816      0.2740803 0.4811392 0.2439541 0.4266912
## 3 ENSG00000129673 0.7401816      0.2740803 0.4811392 0.2439541 0.4266912
## 4 ENSG00000129673 0.7401816      0.2740803 0.4811392 0.2439541 0.4266912
## 5 ENSG00000129673 0.7401816      0.2740803 0.4811392 0.2439541 0.4266912
## 6 ENSG00000129673 0.7401816      0.2740803 0.4811392 0.2439541 0.4266912
##          biotype entrez MYO_PT728 MYO_PT758 MYO_PT1063 MYO_PT1113 MYO_PT1119
## 1 protein_coding     15 0.8004071  1.016947          0   0.815977   1.722282
## 2 protein_coding     15 0.8004071  1.016947          0   0.815977   1.722282
## 3 protein_coding     15 0.8004071  1.016947          0   0.815977   1.722282
## 4 protein_coding     15 0.8004071  1.016947          0   0.815977   1.722282
## 5 protein_coding     15 0.8004071  1.016947          0   0.815977   1.722282
## 6 protein_coding     15 0.8004071  1.016947          0   0.815977   1.722282
##   MYO_PT1123 MYO_PT1151 MYO_PT354 MYO_PT563 MYO_PT845 MYO_PT848 MYO_PT886
## 1          0          0         0         0         0 0.9172141         0
## 2          0          0         0         0         0 0.9172141         0
## 3          0          0         0         0         0 0.9172141         0
## 4          0          0         0         0         0 0.9172141         0
## 5          0          0         0         0         0 0.9172141         0
## 6          0          0         0         0         0 0.9172141         0
##   MYO_PT916 MYO_PT967 MYO_PTc57 LEIO_PT728 LEIO_PT758 LEIO_PT1063 LEIO_PT1113
## 1  1.100972 0.8649241         0          0          0    3.493782           0
## 2  1.100972 0.8649241         0          0          0    3.493782           0
## 3  1.100972 0.8649241         0          0          0    3.493782           0
## 4  1.100972 0.8649241         0          0          0    3.493782           0
## 5  1.100972 0.8649241         0          0          0    3.493782           0
## 6  1.100972 0.8649241         0          0          0    3.493782           0
##   LEIO_PT1119 LEIO_PT1123 LEIO_PT1151 LEIO_PT354 LEIO_PT563 LEIO_PT845
## 1   0.9604267    1.789387           0   0.895243   1.135928   1.261787
## 2   0.9604267    1.789387           0   0.895243   1.135928   1.261787
## 3   0.9604267    1.789387           0   0.895243   1.135928   1.261787
## 4   0.9604267    1.789387           0   0.895243   1.135928   1.261787
## 5   0.9604267    1.789387           0   0.895243   1.135928   1.261787
## 6   0.9604267    1.789387           0   0.895243   1.135928   1.261787
##   LEIO_PT848 LEIO_PT886 LEIO_PT916 LEIO_PT967 LEIO_PTc57
## 1    1.00634   1.142734    2.32974  0.9513567          0
## 2    1.00634   1.142734    2.32974  0.9513567          0
## 3    1.00634   1.142734    2.32974  0.9513567          0
## 4    1.00634   1.142734    2.32974  0.9513567          0
## 5    1.00634   1.142734    2.32974  0.9513567          0
## 6    1.00634   1.142734    2.32974  0.9513567          0

colnames(DF) <- gsub('GeneID','EnsemblID',colnames(DF))
colnames(DF) <- gsub('entrez','EntrezID',colnames(DF))
colnames(DF)

##  [1] "gene"              "proteinSearched"   "EntrezSummary"    
##  [4] "GeneCardsSummary"  "UniProtKB_Summary" "EnsemblID"        
##  [7] "baseMean"          "log2FoldChange"    "lfcSE"            
## [10] "pvalue"            "padj"              "biotype"          
## [13] "EntrezID"          "MYO_PT728"         "MYO_PT758"        
## [16] "MYO_PT1063"        "MYO_PT1113"        "MYO_PT1119"       
## [19] "MYO_PT1123"        "MYO_PT1151"        "MYO_PT354"        
## [22] "MYO_PT563"         "MYO_PT845"         "MYO_PT848"        
## [25] "MYO_PT886"         "MYO_PT916"         "MYO_PT967"        
## [28] "MYO_PTc57"         "LEIO_PT728"        "LEIO_PT758"       
## [31] "LEIO_PT1063"       "LEIO_PT1113"       "LEIO_PT1119"      
## [34] "LEIO_PT1123"       "LEIO_PT1151"       "LEIO_PT354"       
## [37] "LEIO_PT563"        "LEIO_PT845"        "LEIO_PT848"       
## [40] "LEIO_PT886"        "LEIO_PT916"        "LEIO_PT967"       
## [43] "LEIO_PTc57"

bodySystems3 <- DF[,c(1:6,13)]
bs3 <- bodySystems3[!duplicated(bodySystems3),]

write.csv(bs3,'bodySystems3.csv',row.names=F)

DF2 <- DF[!duplicated(DF),c(1:6,14:43)]
row.names(DF2) <- NULL

head(DF2)

##    gene      proteinSearched
## 1 AANAT            melatonin
## 2 ACSL4               col4a5
## 3 ACTA1                fiber
## 4 ADH1B              alcohol
## 5 ADH1C              alcohol
## 6   ALB green-coffee-extract
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 EntrezSummary
## 1                                                                 The protein encoded by this gene belongs to the acetyltransferase superfamily. It is the penultimate enzyme in melatonin synthesis and controls the night/day rhythm in melatonin production in the vertebrate pineal gland. Melatonin is essential for the function of the circadian clock that influences activity and sleep. This enzyme is regulated by cAMP-dependent phosphorylation that promotes its interaction with 14-3-3 proteins and thus protects the enzyme against proteasomal degradation. This gene may contribute to numerous genetic diseases such as delayed sleep phase syndrome. Alternatively spliced transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Oct 2009]
## 2                                                                                                                                                          The protein encoded by this gene is an isozyme of the long-chain fatty-acid-coenzyme A ligase family. Although differing in substrate specificity, subcellular localization, and tissue distribution, all isozymes of this family convert free long-chain fatty acids into fatty acyl-CoA esters, and thereby play a key role in lipid biosynthesis and fatty acid degradation. This isozyme preferentially utilizes arachidonate as substrate. The absence of this enzyme may contribute to the cognitive disability or Alport syndrome. Alternative splicing of this gene generates multiple transcript variants. [provided by RefSeq, Jan 2016]
## 3 The product encoded by this gene belongs to the actin family of proteins, which are highly conserved proteins that play a role in cell motility, structure and integrity. Alpha, beta and gamma actin isoforms have been identified, with alpha actins being a major constituent of the contractile apparatus, while beta and gamma actins are involved in the regulation of cell motility. This actin is an alpha actin that is found in skeletal muscle. Mutations in this gene cause a variety of myopathies, including nemaline myopathy, congenital myopathy with excess of thin myofilaments, congenital myopathy with cores, and congenital myopathy with fiber-type disproportion, diseases that lead to muscle fiber defects with manifestations such as hypotonia. [provided by RefSeq, Sep 2019]
## 4                                                                                                     The protein encoded by this gene is a member of the alcohol dehydrogenase family. Members of this enzyme family metabolize a wide variety of substrates, including ethanol, retinol, other aliphatic alcohols, hydroxysteroids, and lipid peroxidation products. This encoded protein, consisting of several homo- and heterodimers of alpha, beta, and gamma subunits, exhibits high activity for ethanol oxidation and plays a major role in ethanol catabolism. Three genes encoding alpha, beta and gamma subunits are tandemly organized in a genomic segment as a gene cluster. Two transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Nov 2013]
## 5                           This gene encodes class I alcohol dehydrogenase, gamma subunit, which is a member of the alcohol dehydrogenase family. Members of this enzyme family metabolize a wide variety of substrates, including ethanol, retinol, other aliphatic alcohols, hydroxysteroids, and lipid peroxidation products. Class I alcohol dehydrogenase, consisting of several homo- and heterodimers of alpha, beta, and gamma subunits, exhibits high activity for ethanol oxidation to acetaldehyde, thus playing a major role in ethanol catabolism. Three genes encoding alpha, beta and gamma subunits are tandemly organized in a genomic segment as a gene cluster. An association between ADH1C polymorphism and alcohol dependence has not been established. [provided by RefSeq, Sep 2019]
## 6                                                                                                                                                                           This gene encodes the most abundant protein in human blood. This protein functions in the regulation of blood plasma colloid osmotic pressure and acts as a carrier protein for a wide range of endogenous molecules including hormones, fatty acids, and metabolites, as well as exogenous drugs. Additionally, this protein exhibits an esterase-like activity with broad substrate specificity. The encoded preproprotein is proteolytically processed to generate the mature protein. A peptide derived from this protein, EPI-X4, is an endogenous inhibitor of the CXCR4 chemokine receptor. [provided by RefSeq, Jul 2016]
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      GeneCardsSummary
## 1                                                                                                                                                                                                                          AANAT (Aralkylamine N-Acetyltransferase) is a Protein Coding gene.                                            Diseases associated with AANAT include Dissociative Amnesia and Baastrup's Syndrome.                                            Among its related pathways are superpathway of tryptophan utilization and Tryptophan metabolism.                                            Gene Ontology (GO) annotations related to this gene include N-acetyltransferase activity and arylamine N-acetyltransferase activity.                                            
## 2                                             ACSL4 (Acyl-CoA Synthetase Long Chain Family Member 4) is a Protein Coding gene.                                            Diseases associated with ACSL4 include Non-Syndromic X-Linked Intellectual Disability and Stroke, Ischemic.                                            Among its related pathways are Respiratory electron transport, ATP synthesis by chemiosmotic coupling, and heat production by uncoupling proteins. and Fatty acid biosynthesis (KEGG).                                            Gene Ontology (GO) annotations related to this gene include long-chain fatty acid-CoA ligase activity and arachidonate-CoA ligase activity.                                            An important paralog of this gene is ACSL3.
## 3                                                                                                                                  ACTA1 (Actin Alpha 1, Skeletal Muscle) is a Protein Coding gene.                                            Diseases associated with ACTA1 include Myopathy, Scapulohumeroperoneal and Nemaline Myopathy 3.                                            Among its related pathways are Association Between Physico-Chemical Features and Toxicity Associated Pathways and Development Slit-Robo signaling.                                            Gene Ontology (GO) annotations related to this gene include structural constituent of cytoskeleton and myosin binding.                                            An important paralog of this gene is ACTC1.
## 4                                                                                                                                                        ADH1B (Alcohol Dehydrogenase 1B (Class I), Beta Polypeptide) is a Protein Coding gene.                                            Diseases associated with ADH1B include Alcohol Dependence and Fetal Alcohol Syndrome.                                            Among its related pathways are Glucose metabolism and acetone degradation I (to methylglyoxal).                                            Gene Ontology (GO) annotations related to this gene include oxidoreductase activity and alcohol dehydrogenase activity, zinc-dependent.                                            An important paralog of this gene is ADH1C.
## 5                                                                                                                                                                                 ADH1C (Alcohol Dehydrogenase 1C (Class I), Gamma Polypeptide) is a Protein Coding gene.                                            Diseases associated with ADH1C include Alcohol Dependence and Parkinson Disease, Late-Onset.                                            Among its related pathways are Glucose metabolism and Signaling by GPCR.                                            Gene Ontology (GO) annotations related to this gene include oxidoreductase activity and alcohol dehydrogenase (NAD) activity.                                            An important paralog of this gene is ADH1B.
## 6                                                                                                                                                                                                                                                     ALB (Albumin) is a Protein Coding gene.                                            Diseases associated with ALB include Analbuminemia and Hyperthyroxinemia, Familial Dysalbuminemic.                                            Among its related pathways are Lipoprotein metabolism and Folate Metabolism.                                            Gene Ontology (GO) annotations related to this gene include enzyme binding and chaperone binding.                                            An important paralog of this gene is AFP.
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      UniProtKB_Summary
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               Controls the night/day rhythm of melatonin production in the pineal gland. Catalyzes the N-acetylation of serotonin into N-acetylserotonin, the penultimate step in the synthesis of melatonin.\n                         SNAT_HUMAN,Q16613\n                         
## 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       Catalyzes the conversion of long-chain fatty acids to their active form acyl-CoA for both synthesis of cellular lipids, and degradation via beta-oxidation (PubMed:24269233, PubMed:22633490, PubMed:21242590). Preferentially activates arachidonate and eicosapentaenoate as substrates (PubMed:21242590). Preferentially activates 8,9-EET > 14,15-EET > 5,6-EET > 11,12-EET. Modulates glucose-stimulated insulin secretion by regulating the levels of unesterified EETs (By similarity). Modulates prostaglandin E2 secretion (PubMed:21242590).\n                         ACSL4_HUMAN,O60488\n                         
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Actins are highly conserved proteins that are involved in various types of cell motility and are ubiquitously expressed in all eukaryotic cells.\n                         ACTS_HUMAN,P68133\n                         
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Catalyzes the NAD-dependent oxidation of all-trans-retinol and its derivatives such as all-trans-4-hydroxyretinol and may participate in retinoid metabolism (PubMed:15369820, PubMed:16787387). In vitro can also catalyzes the NADH-dependent reduction of all-trans-retinal and its derivatives such as all-trans-4-oxoretinal (PubMed:15369820, PubMed:16787387). Catalyzes in the oxidative direction with higher efficiency (PubMed:16787387). Has the same affinity for all-trans-4-hydroxyretinol and all-trans-4-oxoretinal (PubMed:15369820).\n                         ADH1B_HUMAN,P00325\n                         
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           no summary
## 6 Binds water, Ca(2+), Na(+), K(+), fatty acids, hormones, bilirubin and drugs (Probable). Its main function is the regulation of the colloidal osmotic pressure of blood (Probable). Major zinc transporter in plasma, typically binds about 80% of all plasma zinc (PubMed:19021548). Major calcium and magnesium transporter in plasma, binds approximately 45% of circulating calcium and magnesium in plasma (By similarity). Potentially has more than two calcium-binding sites and might additionally bind calcium in a non-specific manner (By similarity). The shared binding site between zinc and calcium at residue Asp-273 suggests a crosstalk between zinc and calcium transport in the blood (By similarity). The rank order of affinity is zinc > calcium > magnesium (By similarity). Binds to the bacterial siderophore enterobactin and inhibits enterobactin-mediated iron uptake of E.coli from ferric transferrin, and may thereby limit the utilization of iron and growth of enteric bacteria such as E.coli (PubMed:6234017). Does not prevent iron uptake by the bacterial siderophore aerobactin (PubMed:6234017).\n                         ALBU_HUMAN,P02768\n                         
##         EnsemblID    MYO_PT728  MYO_PT758  MYO_PT1063  MYO_PT1113  MYO_PT1119
## 1 ENSG00000129673    0.8004071   1.016947    0.000000    0.815977    1.722282
## 2 ENSG00000068366 1307.0648590 759.659291  745.811051  682.972762  994.617898
## 3 ENSG00000143632    1.6008143   5.084734    3.166926    5.711839    0.861141
## 4 ENSG00000196616 1277.4497950 754.574557 1858.193775 1776.381964 2382.777250
## 5 ENSG00000248144    8.8044785   4.067787   11.875972   13.871609   15.500539
## 6 ENSG00000163631    4.0020357   1.016947    2.375194    0.815977    0.861141
##    MYO_PT1123  MYO_PT1151   MYO_PT354   MYO_PT563   MYO_PT845   MYO_PT848
## 1    0.000000   0.0000000    0.000000    0.000000    0.000000   0.9172141
## 2  971.548511 840.9929351  910.819020  803.749860 1309.894163 725.5163300
## 3    0.000000   4.4733667   13.516721    6.271130   10.970638   0.9172141
## 4 1522.354540 984.1406688 2755.331511 1055.640258  492.581641 668.6490576
## 5    6.110368   1.7893467   13.516721    6.271130    3.291191   9.1721407
## 6    2.618729   0.8946733    1.039748    2.090377    0.000000   5.5032844
##    MYO_PT886   MYO_PT916    MYO_PT967   MYO_PTc57 LEIO_PT728 LEIO_PT758
## 1   0.000000    1.100972    0.8649241    0.000000   0.000000   0.000000
## 2 366.144624  868.666757 1045.6932110  741.454644 574.346865 500.257435
## 3   5.038687    8.807774    0.0000000    7.144638   1.911304   3.050350
## 4 560.973874 1040.418359 1135.6453150 2348.998172  83.141726 110.829391
## 5  16.795625    5.504859   25.9477224   35.723189   0.000000   1.016783
## 6   1.679562    1.100972    0.0000000    1.587697   0.955652   0.000000
##   LEIO_PT1063 LEIO_PT1113 LEIO_PT1119 LEIO_PT1123 LEIO_PT1151 LEIO_PT354
## 1    3.493782     0.00000   0.9604267    1.789387   0.0000000   0.895243
## 2  330.744720   218.47587 385.1311145  518.027520 381.7523318 461.945413
## 3    0.000000     0.00000   4.8021336    1.789387   7.2714730  25.962048
## 4  158.384796   255.27180 505.1844544  369.508404 205.4191119  47.447882
## 5    6.987565    10.73215   8.6438405    8.946935   4.5446706   3.580972
## 6    1.164594     0.00000   0.9604267    0.000000   0.9089341   0.000000
##   LEIO_PT563 LEIO_PT845 LEIO_PT848 LEIO_PT886 LEIO_PT916  LEIO_PT967 LEIO_PTc57
## 1   1.135928   1.261787    1.00634   1.142734    2.32974   0.9513567   0.000000
## 2 152.214365 311.661438  402.53602 573.652585  437.99112 506.1217393 355.888915
## 3   2.271856  11.356085    1.00634   7.999140    3.49461   0.9513567   0.000000
## 4  79.514967 177.911995   26.16484   7.999140   53.58402  11.4162798  98.385673
## 5   7.951497   6.308936    1.00634   1.142734    0.00000   0.0000000   4.858552
## 6   0.000000   0.000000    2.01268   2.285468    0.00000   0.0000000   1.214638

These are matched samples and the last alphanumeric identifier tag matches the ‘MYO’ with the ‘LEIO’ samples from matched uterus samples from hysterectomies. We will shorten the tag ID to 1-15 instead of PT728-PTc57.They are in order already and not mismatched.

colnames(DF2)

##  [1] "gene"              "proteinSearched"   "EntrezSummary"    
##  [4] "GeneCardsSummary"  "UniProtKB_Summary" "EnsemblID"        
##  [7] "MYO_PT728"         "MYO_PT758"         "MYO_PT1063"       
## [10] "MYO_PT1113"        "MYO_PT1119"        "MYO_PT1123"       
## [13] "MYO_PT1151"        "MYO_PT354"         "MYO_PT563"        
## [16] "MYO_PT845"         "MYO_PT848"         "MYO_PT886"        
## [19] "MYO_PT916"         "MYO_PT967"         "MYO_PTc57"        
## [22] "LEIO_PT728"        "LEIO_PT758"        "LEIO_PT1063"      
## [25] "LEIO_PT1113"       "LEIO_PT1119"       "LEIO_PT1123"      
## [28] "LEIO_PT1151"       "LEIO_PT354"        "LEIO_PT563"       
## [31] "LEIO_PT845"        "LEIO_PT848"        "LEIO_PT886"       
## [34] "LEIO_PT916"        "LEIO_PT967"        "LEIO_PTc57"

DF3 <- DF2
names <- c(paste(rep('healthy',15),1:15,sep='_'),
           paste(rep('leiomyoma',15),1:15,sep='_'))
colnames(DF3)[7:36] <- names
colnames(DF3)

##  [1] "gene"              "proteinSearched"   "EntrezSummary"    
##  [4] "GeneCardsSummary"  "UniProtKB_Summary" "EnsemblID"        
##  [7] "healthy_1"         "healthy_2"         "healthy_3"        
## [10] "healthy_4"         "healthy_5"         "healthy_6"        
## [13] "healthy_7"         "healthy_8"         "healthy_9"        
## [16] "healthy_10"        "healthy_11"        "healthy_12"       
## [19] "healthy_13"        "healthy_14"        "healthy_15"       
## [22] "leiomyoma_1"       "leiomyoma_2"       "leiomyoma_3"      
## [25] "leiomyoma_4"       "leiomyoma_5"       "leiomyoma_6"      
## [28] "leiomyoma_7"       "leiomyoma_8"       "leiomyoma_9"      
## [31] "leiomyoma_10"      "leiomyoma_11"      "leiomyoma_12"     
## [34] "leiomyoma_13"      "leiomyoma_14"      "leiomyoma_15"

DF4 <- DF3 %>% group_by(gene) %>% count(gene)

colnames(DF4)[2] <- 'geneCounts_GSE128242'

DF5 <- DF3 %>% group_by(gene) %>% summarise_at(vars('healthy_1':'leiomyoma_15'),mean)

DF5$healthyMean <- apply(DF5[,2:16],1,mean)
DF5$leiomyomaMean <- apply(DF5[,17:31],1,mean)
DF5$LEIO_Healthy_foldChange <- ifelse(DF5$leiomyomaMean/DF5$healthyMean=='Inf',
  1+DF5$leiomyomaMean,
ifelse(DF5$leiomyomaMean/DF5$healthyMean=='NaN',1,
ifelse(DF5$leiomyomaMean/DF5$healthyMean<=0,
  1-DF5$healthyMean,DF5$leiomyomaMean/DF5$healthyMean)))
head(DF5)

## # A tibble: 6 x 34
##   gene  healthy_1 healthy_2 healthy_3 healthy_4 healthy_5 healthy_6 healthy_7
##   <fct>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
## 1 AANAT     0.800      1.02      0        0.816     1.72       0        0    
## 2 ACSL4  1307.       760.      746.     683.      995.       972.     841.   
## 3 ACTA1     1.60       5.08      3.17     5.71      0.861      0        4.47 
## 4 ADH1B  1277.       755.     1858.    1776.     2383.      1522.     984.   
## 5 ADH1C     8.80       4.07     11.9     13.9      15.5        6.11     1.79 
## 6 ALB       4.00       1.02      2.38     0.816     0.861      2.62     0.895
## # ... with 26 more variables: healthy_8 <dbl>, healthy_9 <dbl>,
## #   healthy_10 <dbl>, healthy_11 <dbl>, healthy_12 <dbl>, healthy_13 <dbl>,
## #   healthy_14 <dbl>, healthy_15 <dbl>, leiomyoma_1 <dbl>, leiomyoma_2 <dbl>,
## #   leiomyoma_3 <dbl>, leiomyoma_4 <dbl>, leiomyoma_5 <dbl>, leiomyoma_6 <dbl>,
## #   leiomyoma_7 <dbl>, leiomyoma_8 <dbl>, leiomyoma_9 <dbl>,
## #   leiomyoma_10 <dbl>, leiomyoma_11 <dbl>, leiomyoma_12 <dbl>,
## #   leiomyoma_13 <dbl>, leiomyoma_14 <dbl>, leiomyoma_15 <dbl>,
## #   healthyMean <dbl>, leiomyomaMean <dbl>, LEIO_Healthy_foldChange <dbl>

DF6 <- merge(bs3,DF4,by.x='gene',by.y='gene')

DF7 <- merge(DF6,DF5,by.x='gene',by.y='gene')

colnames(DF7)

##  [1] "gene"                    "proteinSearched"        
##  [3] "EntrezSummary"           "GeneCardsSummary"       
##  [5] "UniProtKB_Summary"       "EnsemblID"              
##  [7] "EntrezID"                "geneCounts_GSE128242"   
##  [9] "healthy_1"               "healthy_2"              
## [11] "healthy_3"               "healthy_4"              
## [13] "healthy_5"               "healthy_6"              
## [15] "healthy_7"               "healthy_8"              
## [17] "healthy_9"               "healthy_10"             
## [19] "healthy_11"              "healthy_12"             
## [21] "healthy_13"              "healthy_14"             
## [23] "healthy_15"              "leiomyoma_1"            
## [25] "leiomyoma_2"             "leiomyoma_3"            
## [27] "leiomyoma_4"             "leiomyoma_5"            
## [29] "leiomyoma_6"             "leiomyoma_7"            
## [31] "leiomyoma_8"             "leiomyoma_9"            
## [33] "leiomyoma_10"            "leiomyoma_11"           
## [35] "leiomyoma_12"            "leiomyoma_13"           
## [37] "leiomyoma_14"            "leiomyoma_15"           
## [39] "healthyMean"             "leiomyomaMean"          
## [41] "LEIO_Healthy_foldChange"

head(DF7)

##    gene      proteinSearched
## 1 AANAT            melatonin
## 2 ACSL4               col4a5
## 3 ACTA1                fiber
## 4 ADH1B              alcohol
## 5 ADH1C              alcohol
## 6   ALB green-coffee-extract
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 EntrezSummary
## 1                                                                 The protein encoded by this gene belongs to the acetyltransferase superfamily. It is the penultimate enzyme in melatonin synthesis and controls the night/day rhythm in melatonin production in the vertebrate pineal gland. Melatonin is essential for the function of the circadian clock that influences activity and sleep. This enzyme is regulated by cAMP-dependent phosphorylation that promotes its interaction with 14-3-3 proteins and thus protects the enzyme against proteasomal degradation. This gene may contribute to numerous genetic diseases such as delayed sleep phase syndrome. Alternatively spliced transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Oct 2009]
## 2                                                                                                                                                          The protein encoded by this gene is an isozyme of the long-chain fatty-acid-coenzyme A ligase family. Although differing in substrate specificity, subcellular localization, and tissue distribution, all isozymes of this family convert free long-chain fatty acids into fatty acyl-CoA esters, and thereby play a key role in lipid biosynthesis and fatty acid degradation. This isozyme preferentially utilizes arachidonate as substrate. The absence of this enzyme may contribute to the cognitive disability or Alport syndrome. Alternative splicing of this gene generates multiple transcript variants. [provided by RefSeq, Jan 2016]
## 3 The product encoded by this gene belongs to the actin family of proteins, which are highly conserved proteins that play a role in cell motility, structure and integrity. Alpha, beta and gamma actin isoforms have been identified, with alpha actins being a major constituent of the contractile apparatus, while beta and gamma actins are involved in the regulation of cell motility. This actin is an alpha actin that is found in skeletal muscle. Mutations in this gene cause a variety of myopathies, including nemaline myopathy, congenital myopathy with excess of thin myofilaments, congenital myopathy with cores, and congenital myopathy with fiber-type disproportion, diseases that lead to muscle fiber defects with manifestations such as hypotonia. [provided by RefSeq, Sep 2019]
## 4                                                                                                     The protein encoded by this gene is a member of the alcohol dehydrogenase family. Members of this enzyme family metabolize a wide variety of substrates, including ethanol, retinol, other aliphatic alcohols, hydroxysteroids, and lipid peroxidation products. This encoded protein, consisting of several homo- and heterodimers of alpha, beta, and gamma subunits, exhibits high activity for ethanol oxidation and plays a major role in ethanol catabolism. Three genes encoding alpha, beta and gamma subunits are tandemly organized in a genomic segment as a gene cluster. Two transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Nov 2013]
## 5                           This gene encodes class I alcohol dehydrogenase, gamma subunit, which is a member of the alcohol dehydrogenase family. Members of this enzyme family metabolize a wide variety of substrates, including ethanol, retinol, other aliphatic alcohols, hydroxysteroids, and lipid peroxidation products. Class I alcohol dehydrogenase, consisting of several homo- and heterodimers of alpha, beta, and gamma subunits, exhibits high activity for ethanol oxidation to acetaldehyde, thus playing a major role in ethanol catabolism. Three genes encoding alpha, beta and gamma subunits are tandemly organized in a genomic segment as a gene cluster. An association between ADH1C polymorphism and alcohol dependence has not been established. [provided by RefSeq, Sep 2019]
## 6                                                                                                                                                                           This gene encodes the most abundant protein in human blood. This protein functions in the regulation of blood plasma colloid osmotic pressure and acts as a carrier protein for a wide range of endogenous molecules including hormones, fatty acids, and metabolites, as well as exogenous drugs. Additionally, this protein exhibits an esterase-like activity with broad substrate specificity. The encoded preproprotein is proteolytically processed to generate the mature protein. A peptide derived from this protein, EPI-X4, is an endogenous inhibitor of the CXCR4 chemokine receptor. [provided by RefSeq, Jul 2016]
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      GeneCardsSummary
## 1                                                                                                                                                                                                                          AANAT (Aralkylamine N-Acetyltransferase) is a Protein Coding gene.                                            Diseases associated with AANAT include Dissociative Amnesia and Baastrup's Syndrome.                                            Among its related pathways are superpathway of tryptophan utilization and Tryptophan metabolism.                                            Gene Ontology (GO) annotations related to this gene include N-acetyltransferase activity and arylamine N-acetyltransferase activity.                                            
## 2                                             ACSL4 (Acyl-CoA Synthetase Long Chain Family Member 4) is a Protein Coding gene.                                            Diseases associated with ACSL4 include Non-Syndromic X-Linked Intellectual Disability and Stroke, Ischemic.                                            Among its related pathways are Respiratory electron transport, ATP synthesis by chemiosmotic coupling, and heat production by uncoupling proteins. and Fatty acid biosynthesis (KEGG).                                            Gene Ontology (GO) annotations related to this gene include long-chain fatty acid-CoA ligase activity and arachidonate-CoA ligase activity.                                            An important paralog of this gene is ACSL3.
## 3                                                                                                                                  ACTA1 (Actin Alpha 1, Skeletal Muscle) is a Protein Coding gene.                                            Diseases associated with ACTA1 include Myopathy, Scapulohumeroperoneal and Nemaline Myopathy 3.                                            Among its related pathways are Association Between Physico-Chemical Features and Toxicity Associated Pathways and Development Slit-Robo signaling.                                            Gene Ontology (GO) annotations related to this gene include structural constituent of cytoskeleton and myosin binding.                                            An important paralog of this gene is ACTC1.
## 4                                                                                                                                                        ADH1B (Alcohol Dehydrogenase 1B (Class I), Beta Polypeptide) is a Protein Coding gene.                                            Diseases associated with ADH1B include Alcohol Dependence and Fetal Alcohol Syndrome.                                            Among its related pathways are Glucose metabolism and acetone degradation I (to methylglyoxal).                                            Gene Ontology (GO) annotations related to this gene include oxidoreductase activity and alcohol dehydrogenase activity, zinc-dependent.                                            An important paralog of this gene is ADH1C.
## 5                                                                                                                                                                                 ADH1C (Alcohol Dehydrogenase 1C (Class I), Gamma Polypeptide) is a Protein Coding gene.                                            Diseases associated with ADH1C include Alcohol Dependence and Parkinson Disease, Late-Onset.                                            Among its related pathways are Glucose metabolism and Signaling by GPCR.                                            Gene Ontology (GO) annotations related to this gene include oxidoreductase activity and alcohol dehydrogenase (NAD) activity.                                            An important paralog of this gene is ADH1B.
## 6                                                                                                                                                                                                                                                     ALB (Albumin) is a Protein Coding gene.                                            Diseases associated with ALB include Analbuminemia and Hyperthyroxinemia, Familial Dysalbuminemic.                                            Among its related pathways are Lipoprotein metabolism and Folate Metabolism.                                            Gene Ontology (GO) annotations related to this gene include enzyme binding and chaperone binding.                                            An important paralog of this gene is AFP.
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      UniProtKB_Summary
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               Controls the night/day rhythm of melatonin production in the pineal gland. Catalyzes the N-acetylation of serotonin into N-acetylserotonin, the penultimate step in the synthesis of melatonin.\n                         SNAT_HUMAN,Q16613\n                         
## 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       Catalyzes the conversion of long-chain fatty acids to their active form acyl-CoA for both synthesis of cellular lipids, and degradation via beta-oxidation (PubMed:24269233, PubMed:22633490, PubMed:21242590). Preferentially activates arachidonate and eicosapentaenoate as substrates (PubMed:21242590). Preferentially activates 8,9-EET > 14,15-EET > 5,6-EET > 11,12-EET. Modulates glucose-stimulated insulin secretion by regulating the levels of unesterified EETs (By similarity). Modulates prostaglandin E2 secretion (PubMed:21242590).\n                         ACSL4_HUMAN,O60488\n                         
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Actins are highly conserved proteins that are involved in various types of cell motility and are ubiquitously expressed in all eukaryotic cells.\n                         ACTS_HUMAN,P68133\n                         
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Catalyzes the NAD-dependent oxidation of all-trans-retinol and its derivatives such as all-trans-4-hydroxyretinol and may participate in retinoid metabolism (PubMed:15369820, PubMed:16787387). In vitro can also catalyzes the NADH-dependent reduction of all-trans-retinal and its derivatives such as all-trans-4-oxoretinal (PubMed:15369820, PubMed:16787387). Catalyzes in the oxidative direction with higher efficiency (PubMed:16787387). Has the same affinity for all-trans-4-hydroxyretinol and all-trans-4-oxoretinal (PubMed:15369820).\n                         ADH1B_HUMAN,P00325\n                         
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           no summary
## 6 Binds water, Ca(2+), Na(+), K(+), fatty acids, hormones, bilirubin and drugs (Probable). Its main function is the regulation of the colloidal osmotic pressure of blood (Probable). Major zinc transporter in plasma, typically binds about 80% of all plasma zinc (PubMed:19021548). Major calcium and magnesium transporter in plasma, binds approximately 45% of circulating calcium and magnesium in plasma (By similarity). Potentially has more than two calcium-binding sites and might additionally bind calcium in a non-specific manner (By similarity). The shared binding site between zinc and calcium at residue Asp-273 suggests a crosstalk between zinc and calcium transport in the blood (By similarity). The rank order of affinity is zinc > calcium > magnesium (By similarity). Binds to the bacterial siderophore enterobactin and inhibits enterobactin-mediated iron uptake of E.coli from ferric transferrin, and may thereby limit the utilization of iron and growth of enteric bacteria such as E.coli (PubMed:6234017). Does not prevent iron uptake by the bacterial siderophore aerobactin (PubMed:6234017).\n                         ALBU_HUMAN,P02768\n                         
##         EnsemblID EntrezID geneCounts_GSE128242    healthy_1  healthy_2
## 1 ENSG00000129673       15                    1    0.8004071   1.016947
## 2 ENSG00000068366     2182                    1 1307.0648590 759.659291
## 3 ENSG00000143632       58                    1    1.6008143   5.084734
## 4 ENSG00000196616      125                    1 1277.4497950 754.574557
## 5 ENSG00000248144      126                    1    8.8044785   4.067787
## 6 ENSG00000163631      213                    4    4.0020357   1.016947
##     healthy_3   healthy_4   healthy_5   healthy_6   healthy_7   healthy_8
## 1    0.000000    0.815977    1.722282    0.000000   0.0000000    0.000000
## 2  745.811051  682.972762  994.617898  971.548511 840.9929351  910.819020
## 3    3.166926    5.711839    0.861141    0.000000   4.4733667   13.516721
## 4 1858.193775 1776.381964 2382.777250 1522.354540 984.1406688 2755.331511
## 5   11.875972   13.871609   15.500539    6.110368   1.7893467   13.516721
## 6    2.375194    0.815977    0.861141    2.618729   0.8946733    1.039748
##     healthy_9  healthy_10  healthy_11 healthy_12  healthy_13   healthy_14
## 1    0.000000    0.000000   0.9172141   0.000000    1.100972    0.8649241
## 2  803.749860 1309.894163 725.5163300 366.144624  868.666757 1045.6932110
## 3    6.271130   10.970638   0.9172141   5.038687    8.807774    0.0000000
## 4 1055.640258  492.581641 668.6490576 560.973874 1040.418359 1135.6453150
## 5    6.271130    3.291191   9.1721407  16.795625    5.504859   25.9477224
## 6    2.090377    0.000000   5.5032844   1.679562    1.100972    0.0000000
##    healthy_15 leiomyoma_1 leiomyoma_2 leiomyoma_3 leiomyoma_4 leiomyoma_5
## 1    0.000000    0.000000    0.000000    3.493782     0.00000   0.9604267
## 2  741.454644  574.346865  500.257435  330.744720   218.47587 385.1311145
## 3    7.144638    1.911304    3.050350    0.000000     0.00000   4.8021336
## 4 2348.998172   83.141726  110.829391  158.384796   255.27180 505.1844544
## 5   35.723189    0.000000    1.016783    6.987565    10.73215   8.6438405
## 6    1.587697    0.955652    0.000000    1.164594     0.00000   0.9604267
##   leiomyoma_6 leiomyoma_7 leiomyoma_8 leiomyoma_9 leiomyoma_10 leiomyoma_11
## 1    1.789387   0.0000000    0.895243    1.135928     1.261787      1.00634
## 2  518.027520 381.7523318  461.945413  152.214365   311.661438    402.53602
## 3    1.789387   7.2714730   25.962048    2.271856    11.356085      1.00634
## 4  369.508404 205.4191119   47.447882   79.514967   177.911995     26.16484
## 5    8.946935   4.5446706    3.580972    7.951497     6.308936      1.00634
## 6    0.000000   0.9089341    0.000000    0.000000     0.000000      2.01268
##   leiomyoma_12 leiomyoma_13 leiomyoma_14 leiomyoma_15  healthyMean
## 1     1.142734      2.32974    0.9513567     0.000000    0.4825815
## 2   573.652585    437.99112  506.1217393   355.888915  871.6403945
## 3     7.999140      3.49461    0.9513567     0.000000    4.9043749
## 4     7.999140     53.58402   11.4162798    98.385673 1374.2740491
## 5     1.142734      0.00000    0.0000000     4.858552   11.8828453
## 6     2.285468      0.00000    0.0000000     1.214638    1.7057559
##   leiomyomaMean LEIO_Healthy_foldChange
## 1     0.9977817               2.0675919
## 2   407.3831633               0.4673753
## 3     4.7910722               0.9768976
## 4   146.0109654               0.1062459
## 5     4.3813982               0.3687163
## 6     0.6334929               0.3713854

DF8 <- gather(DF7,key='sample',value='sampleValue',9:38)
ul <- grep('leio',DF8$sample)
healthy <- grep('healthy',DF8$sample)
DF8$group <- 'group'
DF8[ul,14] <- 'leiomyoma'
DF8[healthy,14] <- 'healthy myo'
unique(DF8$group)

## [1] "healthy myo" "leiomyoma"

head(DF8)

##    gene      proteinSearched
## 1 AANAT            melatonin
## 2 ACSL4               col4a5
## 3 ACTA1                fiber
## 4 ADH1B              alcohol
## 5 ADH1C              alcohol
## 6   ALB green-coffee-extract
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 EntrezSummary
## 1                                                                 The protein encoded by this gene belongs to the acetyltransferase superfamily. It is the penultimate enzyme in melatonin synthesis and controls the night/day rhythm in melatonin production in the vertebrate pineal gland. Melatonin is essential for the function of the circadian clock that influences activity and sleep. This enzyme is regulated by cAMP-dependent phosphorylation that promotes its interaction with 14-3-3 proteins and thus protects the enzyme against proteasomal degradation. This gene may contribute to numerous genetic diseases such as delayed sleep phase syndrome. Alternatively spliced transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Oct 2009]
## 2                                                                                                                                                          The protein encoded by this gene is an isozyme of the long-chain fatty-acid-coenzyme A ligase family. Although differing in substrate specificity, subcellular localization, and tissue distribution, all isozymes of this family convert free long-chain fatty acids into fatty acyl-CoA esters, and thereby play a key role in lipid biosynthesis and fatty acid degradation. This isozyme preferentially utilizes arachidonate as substrate. The absence of this enzyme may contribute to the cognitive disability or Alport syndrome. Alternative splicing of this gene generates multiple transcript variants. [provided by RefSeq, Jan 2016]
## 3 The product encoded by this gene belongs to the actin family of proteins, which are highly conserved proteins that play a role in cell motility, structure and integrity. Alpha, beta and gamma actin isoforms have been identified, with alpha actins being a major constituent of the contractile apparatus, while beta and gamma actins are involved in the regulation of cell motility. This actin is an alpha actin that is found in skeletal muscle. Mutations in this gene cause a variety of myopathies, including nemaline myopathy, congenital myopathy with excess of thin myofilaments, congenital myopathy with cores, and congenital myopathy with fiber-type disproportion, diseases that lead to muscle fiber defects with manifestations such as hypotonia. [provided by RefSeq, Sep 2019]
## 4                                                                                                     The protein encoded by this gene is a member of the alcohol dehydrogenase family. Members of this enzyme family metabolize a wide variety of substrates, including ethanol, retinol, other aliphatic alcohols, hydroxysteroids, and lipid peroxidation products. This encoded protein, consisting of several homo- and heterodimers of alpha, beta, and gamma subunits, exhibits high activity for ethanol oxidation and plays a major role in ethanol catabolism. Three genes encoding alpha, beta and gamma subunits are tandemly organized in a genomic segment as a gene cluster. Two transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Nov 2013]
## 5                           This gene encodes class I alcohol dehydrogenase, gamma subunit, which is a member of the alcohol dehydrogenase family. Members of this enzyme family metabolize a wide variety of substrates, including ethanol, retinol, other aliphatic alcohols, hydroxysteroids, and lipid peroxidation products. Class I alcohol dehydrogenase, consisting of several homo- and heterodimers of alpha, beta, and gamma subunits, exhibits high activity for ethanol oxidation to acetaldehyde, thus playing a major role in ethanol catabolism. Three genes encoding alpha, beta and gamma subunits are tandemly organized in a genomic segment as a gene cluster. An association between ADH1C polymorphism and alcohol dependence has not been established. [provided by RefSeq, Sep 2019]
## 6                                                                                                                                                                           This gene encodes the most abundant protein in human blood. This protein functions in the regulation of blood plasma colloid osmotic pressure and acts as a carrier protein for a wide range of endogenous molecules including hormones, fatty acids, and metabolites, as well as exogenous drugs. Additionally, this protein exhibits an esterase-like activity with broad substrate specificity. The encoded preproprotein is proteolytically processed to generate the mature protein. A peptide derived from this protein, EPI-X4, is an endogenous inhibitor of the CXCR4 chemokine receptor. [provided by RefSeq, Jul 2016]
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      GeneCardsSummary
## 1                                                                                                                                                                                                                          AANAT (Aralkylamine N-Acetyltransferase) is a Protein Coding gene.                                            Diseases associated with AANAT include Dissociative Amnesia and Baastrup's Syndrome.                                            Among its related pathways are superpathway of tryptophan utilization and Tryptophan metabolism.                                            Gene Ontology (GO) annotations related to this gene include N-acetyltransferase activity and arylamine N-acetyltransferase activity.                                            
## 2                                             ACSL4 (Acyl-CoA Synthetase Long Chain Family Member 4) is a Protein Coding gene.                                            Diseases associated with ACSL4 include Non-Syndromic X-Linked Intellectual Disability and Stroke, Ischemic.                                            Among its related pathways are Respiratory electron transport, ATP synthesis by chemiosmotic coupling, and heat production by uncoupling proteins. and Fatty acid biosynthesis (KEGG).                                            Gene Ontology (GO) annotations related to this gene include long-chain fatty acid-CoA ligase activity and arachidonate-CoA ligase activity.                                            An important paralog of this gene is ACSL3.
## 3                                                                                                                                  ACTA1 (Actin Alpha 1, Skeletal Muscle) is a Protein Coding gene.                                            Diseases associated with ACTA1 include Myopathy, Scapulohumeroperoneal and Nemaline Myopathy 3.                                            Among its related pathways are Association Between Physico-Chemical Features and Toxicity Associated Pathways and Development Slit-Robo signaling.                                            Gene Ontology (GO) annotations related to this gene include structural constituent of cytoskeleton and myosin binding.                                            An important paralog of this gene is ACTC1.
## 4                                                                                                                                                        ADH1B (Alcohol Dehydrogenase 1B (Class I), Beta Polypeptide) is a Protein Coding gene.                                            Diseases associated with ADH1B include Alcohol Dependence and Fetal Alcohol Syndrome.                                            Among its related pathways are Glucose metabolism and acetone degradation I (to methylglyoxal).                                            Gene Ontology (GO) annotations related to this gene include oxidoreductase activity and alcohol dehydrogenase activity, zinc-dependent.                                            An important paralog of this gene is ADH1C.
## 5                                                                                                                                                                                 ADH1C (Alcohol Dehydrogenase 1C (Class I), Gamma Polypeptide) is a Protein Coding gene.                                            Diseases associated with ADH1C include Alcohol Dependence and Parkinson Disease, Late-Onset.                                            Among its related pathways are Glucose metabolism and Signaling by GPCR.                                            Gene Ontology (GO) annotations related to this gene include oxidoreductase activity and alcohol dehydrogenase (NAD) activity.                                            An important paralog of this gene is ADH1B.
## 6                                                                                                                                                                                                                                                     ALB (Albumin) is a Protein Coding gene.                                            Diseases associated with ALB include Analbuminemia and Hyperthyroxinemia, Familial Dysalbuminemic.                                            Among its related pathways are Lipoprotein metabolism and Folate Metabolism.                                            Gene Ontology (GO) annotations related to this gene include enzyme binding and chaperone binding.                                            An important paralog of this gene is AFP.
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      UniProtKB_Summary
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               Controls the night/day rhythm of melatonin production in the pineal gland. Catalyzes the N-acetylation of serotonin into N-acetylserotonin, the penultimate step in the synthesis of melatonin.\n                         SNAT_HUMAN,Q16613\n                         
## 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       Catalyzes the conversion of long-chain fatty acids to their active form acyl-CoA for both synthesis of cellular lipids, and degradation via beta-oxidation (PubMed:24269233, PubMed:22633490, PubMed:21242590). Preferentially activates arachidonate and eicosapentaenoate as substrates (PubMed:21242590). Preferentially activates 8,9-EET > 14,15-EET > 5,6-EET > 11,12-EET. Modulates glucose-stimulated insulin secretion by regulating the levels of unesterified EETs (By similarity). Modulates prostaglandin E2 secretion (PubMed:21242590).\n                         ACSL4_HUMAN,O60488\n                         
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Actins are highly conserved proteins that are involved in various types of cell motility and are ubiquitously expressed in all eukaryotic cells.\n                         ACTS_HUMAN,P68133\n                         
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Catalyzes the NAD-dependent oxidation of all-trans-retinol and its derivatives such as all-trans-4-hydroxyretinol and may participate in retinoid metabolism (PubMed:15369820, PubMed:16787387). In vitro can also catalyzes the NADH-dependent reduction of all-trans-retinal and its derivatives such as all-trans-4-oxoretinal (PubMed:15369820, PubMed:16787387). Catalyzes in the oxidative direction with higher efficiency (PubMed:16787387). Has the same affinity for all-trans-4-hydroxyretinol and all-trans-4-oxoretinal (PubMed:15369820).\n                         ADH1B_HUMAN,P00325\n                         
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           no summary
## 6 Binds water, Ca(2+), Na(+), K(+), fatty acids, hormones, bilirubin and drugs (Probable). Its main function is the regulation of the colloidal osmotic pressure of blood (Probable). Major zinc transporter in plasma, typically binds about 80% of all plasma zinc (PubMed:19021548). Major calcium and magnesium transporter in plasma, binds approximately 45% of circulating calcium and magnesium in plasma (By similarity). Potentially has more than two calcium-binding sites and might additionally bind calcium in a non-specific manner (By similarity). The shared binding site between zinc and calcium at residue Asp-273 suggests a crosstalk between zinc and calcium transport in the blood (By similarity). The rank order of affinity is zinc > calcium > magnesium (By similarity). Binds to the bacterial siderophore enterobactin and inhibits enterobactin-mediated iron uptake of E.coli from ferric transferrin, and may thereby limit the utilization of iron and growth of enteric bacteria such as E.coli (PubMed:6234017). Does not prevent iron uptake by the bacterial siderophore aerobactin (PubMed:6234017).\n                         ALBU_HUMAN,P02768\n                         
##         EnsemblID EntrezID geneCounts_GSE128242  healthyMean leiomyomaMean
## 1 ENSG00000129673       15                    1    0.4825815     0.9977817
## 2 ENSG00000068366     2182                    1  871.6403945   407.3831633
## 3 ENSG00000143632       58                    1    4.9043749     4.7910722
## 4 ENSG00000196616      125                    1 1374.2740491   146.0109654
## 5 ENSG00000248144      126                    1   11.8828453     4.3813982
## 6 ENSG00000163631      213                    4    1.7057559     0.6334929
##   LEIO_Healthy_foldChange    sample  sampleValue       group
## 1               2.0675919 healthy_1    0.8004071 healthy myo
## 2               0.4673753 healthy_1 1307.0648590 healthy myo
## 3               0.9768976 healthy_1    1.6008143 healthy myo
## 4               0.1062459 healthy_1 1277.4497950 healthy myo
## 5               0.3687163 healthy_1    8.8044785 healthy myo
## 6               0.3713854 healthy_1    4.0020357 healthy myo

write.csv(DF8,'UL_FCs_systemGenes.csv',row.names=F)

What would be great, would be to combine these genes and the UL data or uterine leiomyoma data of this study with another UL study, the COVID-19 study, the Rheumatoid Arthritis study, and the ulcerative colitis and Crohn’s disease. We will do this and then make an interactive Tableau dashboard that will allow for selecting the body system, gene, and disease values to see the fold change, mean values per group comparison of healthy and diseased or treated, and the individual sample values in that disease study groups. Some names of columns will have to be changed so that instead of a merge we row bind the needed fields together and drop unneeded ones like the gene counts per gene of that particular study when dealing with copy number variants we didn’t show. Or we can add it in to those studies that didn’t have it added. The body system genes we now have should be added to the orginal data of those studies as we bring the larger data of each study into this analysis.

I emptied out the environment and want to read in all of our data we want to combine. We have variations on the fold change values from each study if there are any fold change values added to the data. Also, the Lyme disease study has log2 normalized data that is probably quantile normalized, hence the negative values, but I want positive values, so I am going to take every value in that data and raise it as an exponent of base 2, so that they are positive, and log2 is done after normalization, so it is closer to the raw values. Some of the data also has the copy number variants of genes as the disease study gene counts of genes. Lets read in all our data first, then look at the differences.

UL_study1 <- read.csv('UL_studyGSE120854_fCs.csv',
                      sep=',',
                      header=T, na.strings=c('',' ','NA'),
                      stringsAsFactors = T)
UL_study2 <- read.csv('UL_FCs_systemGenes.csv',
                      sep=',',
                      header=T, na.strings=c('',' ','NA'),
                      stringsAsFactors = T)
UC_CD <- read.csv('allRawDataCombinedGSE135223_UC_CD.csv',
                      sep=',',
                      header=T, na.strings=c('',' ','NA'),
                      stringsAsFactors = T)
LymeDisease <- read.csv('LymeDisease_log2norm.csv',
                        sep=',',
                      header=T, na.strings=c('',' ','NA'),
                      stringsAsFactors = T)
RA <- read.csv('RA_data_fcs.csv',
               sep=',',
                      header=T, na.strings=c('',' ','NA'),
                      stringsAsFactors = T)
covid19 <- read.csv('DATA_FCs_GSE152418_covid19.csv',
                    sep=',',
                      header=T, na.strings=c('',' ','NA'),
                      stringsAsFactors = T)

The RA and both UL studys have gene counts, but the COVID19 doesn’t because each gene only had one version of itself, and also doesn’t have the gene counts column. This was verified by going back to the script that made that file v3 of the original and grouping by the ENSEMBLID to get the count and seeing the only value was 1 and the number of genes was the same as the number of genes in total. The Lyme disease data hasn’t had any aggregate sums done to it, because it still needs to all be converted to having each value the power of base 2 to reverse the log2 and make all values positive. The Ulcerative Colitis and Crohns disease data doesn’t have the gene counts of its data set, but also there weren’t any aggregations done to this data, so we can get the gene counts of the UC_CD, LymeDisease, and covid19 data sets in this script.

cv19 <- covid19 %>% group_by(ENSEMBLID) %>% count(ENSEMBLID)
colnames(cv19)[2] <- 'cv19_geneCount'

covid19b <- merge(cv19,covid19,by.x='ENSEMBLID',by.y='ENSEMBLID')

uccd <- UC_CD %>% group_by(Ensembl_ID) %>% count(Ensembl_ID)
colnames(uccd)[2] <- 'UC_CD_geneCount'

UC_CDb <- merge(uccd,UC_CD,by.x='Ensembl_ID',by.y='Ensembl_ID')

ld <- LymeDisease %>% group_by(Gene) %>% count(Gene)

## Warning: Factor `Gene` contains implicit NA, consider using
## `forcats::fct_explicit_na`

## Warning: Factor `Gene` contains implicit NA, consider using
## `forcats::fct_explicit_na`

## Warning: Factor `Gene` contains implicit NA, consider using
## `forcats::fct_explicit_na`

colnames(ld)[2] <- 'Lyme_geneCount'

LymeDisease2 <- merge(ld,LymeDisease,by.x='Gene',by.y='Gene')

LymeDisease3 <- LymeDisease2[complete.cases(LymeDisease2$Gene),]

LymeDisease4 <- LymeDisease3[,-c(1:2)]
LymeDisease4 <- 2^(LymeDisease4) #inverse log2 on data
LymeDisease5 <- cbind(LymeDisease3[,1:2],LymeDisease4)

#the GSM ID header isn't useful, we'll read in the header data for this study
metaLyme <- read.csv('descriptors2.csv')
metaLyme$Sample_Title <- as.character(paste(metaLyme$Sample_Title))

metaLyme$Sample_Title <- gsub('PBMC total RNA-Healthy control ',
                              'healthy_Lyme_',metaLyme$Sample_Title)
metaLyme$Sample_Title <- gsub('PBMC total RNA-Acute Lyme disease subject ',
                              'acuteLyme_',metaLyme$Sample_Title)
metaLyme$Sample_Title <- gsub('PBMC total RNA-early convalescent Lyme disease subject ','Lyme_anti1month_',metaLyme$Sample_Title)
metaLyme$Sample_Title <- gsub('PBMC total RNA-late convalescent Lyme disease subject ','Lyme_anti6months_',metaLyme$Sample_Title)

colnames(LymeDisease5)[3:88]<-metaLyme$Sample_Title

colnames(covid19b)[2] <- 'geneCount'
colnames(LymeDisease5)[1:2] <- c('gene','geneCount')
colnames(RA)[2] <- 'geneCount'
colnames(UC_CDb)[2] <- 'geneCount'
colnames(UL_study1)[2] <- 'geneCount'
colnames(UL_study2)[8] <- 'geneCount'

RA$group <- 'group'
healthy <- grep('healthy',RA$sample)
treatment <- grep('treatment',RA$sample)
RA[healthy,8] <- 'healthy RA'
RA[treatment,8] <- 'treatment RA abatacept'

RA2 <- gather(RA,key='groupMean',value='groupMeanValue',3:4)
RA3 <- gather(RA2,key='foldChangeGroup',value='foldChangeGroupValue',3)

covid19b2 <- gather(covid19b,key='sample',value='sampleValue',3:36)
covid19b2$group <- 'group'
conv <- grep('convalescent',covid19b2$sample)
healthy <- grep('healthy',covid19b2$sample)
mod <- grep('moderate',covid19b2$sample)
severe <- grep('severe',covid19b2$sample)
icu <- grep('ICU',covid19b2$sample)

covid19b2[conv,12] <- 'convalescent cv19'
covid19b2[healthy,12] <- 'healthy cv19' 
covid19b2[mod,12] <- 'moderate cv19' 
covid19b2[severe,12] <- 'severe cv19' 
covid19b2[icu,12] <- 'ICU cv19' 

cv1 <- covid19b2

cv2 <- gather(cv1,key='groupMean',value='groupMeanValue',3:6)
cv3 <- gather(cv2, key='foldChangeGroup',value='foldChangeGroupValue',3:5)

UL_study1b <- gather(UL_study1,key='groupMean',value='groupMeanValue',3:4)
UL_study1c <- gather(UL_study1b,key='foldChangeGroup',
                     value='foldChangeGroupValue',3)

UL_Study2 <- UL_study2[,-c(2:7)]
UL_study2b <- gather(UL_Study2,key='groupMean',value='groupMeanValue',3:4)
UL_study2c <- gather(UL_study2b,key='foldChangeGroup',
                     value='foldChangeGroupValue',3)

The Lyme Disease and Ulcerative Colitis with Crohn’s Disease data needs to be grouped by their genes, get the group means, and the fold change values between the diseased or treated group means to the healthy group mean value per gene. Then gathered into the sample, mean, and fold change groups with corresponding values.

LD <- LymeDisease5 %>% group_by(gene) %>% summarise_at(vars('healthy_Lyme_1':'Lyme_anti6months_10'),mean)

LD$healthyLymeMean <- apply(LD[,3:23],1,mean)
LD$acuteLymeMean <- apply(LD[24:51],1,mean)
LD$Lyme1monthMean <- apply(LD[52:78],1,mean)
LD$Lyme6monthsMean <- apply(LD[79:88],1,mean)

#use same fold change values as in RA samples
LD$acuteHealthyLymeFoldChange <- ifelse(LD$acuteLymeMean/LD$healthyLymeMean=='Inf',
  1+LD$acuteLymeMean, 
ifelse(LD$acuteLymeMean/LD$healthyLymeMean=='NaN',1,
ifelse(LD$acuteLymeMean/LD$healthyLymeMean<=0,
  1-LD$healthyLymeMean,LD$acuteLymeMean/LD$healthyLymeMean)))

LD$month1HealthyLymeFoldChange <- ifelse(LD$Lyme1monthMean/LD$healthyLymeMean=='Inf',
  1+LD$Lyme1monthMean,
ifelse(LD$Lyme1monthMean/LD$healthyLymeMean=='NaN',1,
ifelse(LD$Lyme1monthMean/LD$healthyLymeMean<=0,
  1-LD$healthyLymeMean,LD$Lyme1monthMean/LD$healthyLymeMean)))

LD$month6HealthyLymeFoldChange <- ifelse(LD$Lyme6monthsMean/LD$healthyLymeMean=='Inf',
  1+LD$Lyme6monthsMean,
ifelse(LD$Lyme6monthsMean/LD$healthyLymeMean=='NaN',1,
ifelse(LD$Lyme6monthsMean/LD$healthyLymeMean<=0,
  1-LD$healthyLymeMean,LD$Lyme6monthsMean/LD$healthyLymeMean)))

LD0 <- LymeDisease5[,1:2]

LD01 <- merge(LD0,LD,by.x='gene',by.y='gene')
LD1 <- gather(LD01,key='sample',value='sampleValue',3:88)
LD1$group <- 'group'
healthy <- grep('healthy',LD1$sample)
acute <- grep('acute',LD1$sample)
month1 <- grep('anti1',LD1$sample)
month6 <- grep('anti6',LD1$sample)
LD1[healthy,12] <- 'healthy Lyme'
LD1[acute,12] <- 'acute Lyme'
LD1[month1,12] <- 'Lyme 1 month antibiotics'
LD1[month6,12] <- 'Lyme 6 months antibiotics'

LD2 <- gather(LD1, key='groupMean',value='groupMeanValue',3:6)
LD3 <- gather(LD2,key='foldChangeGroup',value='foldChangeGroupValue',3:5)

uccb <- UC_CDb %>% group_by(Ensembl_ID) %>% summarise_at(vars('Crohns.Disease.rep.1':'Ulcerative.Colitis.rep.5'),mean)

uccb2 <- UC_CDb[,1:2]
uccb2a <- merge(uccb2,uccb,by.x='Ensembl_ID', by.y='Ensembl_ID')

uccb2a$CrohnsMean <- apply(uccb2a[,3:7],1,mean)
uccb2a$healthyCrohnsMean <- apply(uccb2a[,8:12],1,mean)
uccb2a$mockCrohnsMean <- apply(uccb2a[,13:15],1,mean)
uccb2a$ulcerativeColitisMean <- apply(uccb2a[,16:20],1,mean)

#use modified FCs used in RA data
uccb2a$crohnsHealthyFoldChange <- ifelse(uccb2a$CrohnsMean/uccb2a$healthyCrohnsMean=='Inf',
  1+uccb2a$CrohnsMean,
ifelse(uccb2a$CrohnsMean/uccb2a$healthyCrohnsMean=='NaN',
  1,
ifelse(uccb2a$CrohnsMean/uccb2a$healthyCrohnsMean <= 0,
  1+uccb2a$healthyCrohnsMean,
  uccb2a$CrohnsMean/uccb2a$healthyCrohnsMean)))

uccb2a$mockHealthyFoldChange <- ifelse(uccb2a$mockCrohnsMean/uccb2a$healthyCrohnsMean=='Inf',
  1+uccb2a$mockCrohnsMean,
ifelse(uccb2a$mockCrohnsMean/uccb2a$healthyCrohnsMean=='NaN',
  1,
ifelse(uccb2a$mockCrohnsMean/uccb2a$healthyCrohnsMean<=0,
  1-uccb2a$healthyCrohnsMean,
  uccb2a$mockCrohnsMean/uccb2a$healthyCrohnsMean)))

uccb2a$ulcerColitFoldChange <- ifelse(uccb2a$ulcerativeColitisMean/uccb2a$healthyCrohnsMean=='Inf',
  1+uccb2a$ulcerativeColitisMean,
ifelse(uccb2a$ulcerativeColitisMean/uccb2a$healthyCrohnsMean=='NaN',
  1,
ifelse(uccb2a$ulcerativeColitisMean/uccb2a$healthyCrohnsMean<=0,
  1-uccb2a$healthyCrohnsMean,
  uccb2a$ulcerativeColitisMean/uccb2a$healthyCrohnsMean)))

uccb2b <- gather(uccb2a,key='sample',value='sampleValue',3:20)

uccb2b$group <- 'group'

healthy <- grep('Healthy',uccb2b$sample)
crohns <- grep('Crohns',uccb2b$sample)
mock <- grep('mock',uccb2b$sample)
ulcer <- grep('Ulcer',uccb2b$sample)

uccb2b[healthy,12] <- 'healthy non-IBS'
uccb2b[crohns,12] <- 'Crohns Disease IBS'
uccb2b[mock,12] <- 'mock IBS samples'
uccb2b[ulcer,12] <- 'ulcerative colitis IBS'

uccb2c <- gather(uccb2b,key='groupMean',value='groupMeanValue',3:6)
uccb2d <- gather(uccb2c,key='foldChangeGroup',value='foldChangeGroupValue',3:5)

covid19 <- cv3
LymeDisease <- LD3
RheumatoidArthritis <- RA3
IBS <- uccb2d
UL_study1 <- UL_study1c
UL_study2 <- UL_study2c

colnames(covid19)

## [1] "ENSEMBLID"            "geneCount"            "sample"              
## [4] "sampleValue"          "group"                "groupMean"           
## [7] "groupMeanValue"       "foldChangeGroup"      "foldChangeGroupValue"

colnames(IBS)

## [1] "Ensembl_ID"           "geneCount"            "sample"              
## [4] "sampleValue"          "group"                "groupMean"           
## [7] "groupMeanValue"       "foldChangeGroup"      "foldChangeGroupValue"

colnames(LymeDisease)

## [1] "gene"                 "geneCount"            "sample"              
## [4] "sampleValue"          "group"                "groupMean"           
## [7] "groupMeanValue"       "foldChangeGroup"      "foldChangeGroupValue"

colnames(RheumatoidArthritis)

## [1] "gene"                 "geneCount"            "sample"              
## [4] "sampleValue"          "group"                "groupMean"           
## [7] "groupMeanValue"       "foldChangeGroup"      "foldChangeGroupValue"

colnames(UL_study1)

## [1] "gene"                 "geneCount"            "sample"              
## [4] "sampleValue"          "group"                "groupMean"           
## [7] "groupMeanValue"       "foldChangeGroup"      "foldChangeGroupValue"

colnames(UL_study2)

## [1] "gene"                 "geneCount"            "sample"              
## [4] "sampleValue"          "group"                "groupMean"           
## [7] "groupMeanValue"       "foldChangeGroup"      "foldChangeGroupValue"

systemGenes <- read.csv('bodySystems3.csv')

cv19 <- merge(systemGenes,covid19,by.x='EnsemblID',by.y='ENSEMBLID')
uccd <- merge(systemGenes,IBS,by.x='EnsemblID',by.y='Ensembl_ID')
LD <- merge(systemGenes,LymeDisease,by.x='gene',by.y='gene')
RA <- merge(systemGenes,RheumatoidArthritis,by.x='gene',by.y='gene')
UL1 <- merge(systemGenes,UL_study1,by.x='gene',by.y='gene')
UL2 <- merge(systemGenes,UL_study2,by.x='gene',by.y='gene')

colnames(cv19)

##  [1] "EnsemblID"            "gene"                 "proteinSearched"     
##  [4] "EntrezSummary"        "GeneCardsSummary"     "UniProtKB_Summary"   
##  [7] "EntrezID"             "geneCount"            "sample"              
## [10] "sampleValue"          "group"                "groupMean"           
## [13] "groupMeanValue"       "foldChangeGroup"      "foldChangeGroupValue"

colnames(uccd)

##  [1] "EnsemblID"            "gene"                 "proteinSearched"     
##  [4] "EntrezSummary"        "GeneCardsSummary"     "UniProtKB_Summary"   
##  [7] "EntrezID"             "geneCount"            "sample"              
## [10] "sampleValue"          "group"                "groupMean"           
## [13] "groupMeanValue"       "foldChangeGroup"      "foldChangeGroupValue"

colnames(RA)

##  [1] "gene"                 "proteinSearched"      "EntrezSummary"       
##  [4] "GeneCardsSummary"     "UniProtKB_Summary"    "EnsemblID"           
##  [7] "EntrezID"             "geneCount"            "sample"              
## [10] "sampleValue"          "group"                "groupMean"           
## [13] "groupMeanValue"       "foldChangeGroup"      "foldChangeGroupValue"

uccd1 <- uccd[,c(2:6,1,7:15)]
cv19b <- cv19[,c(2:6,1,7:15)]

colnames(uccd1)

##  [1] "gene"                 "proteinSearched"      "EntrezSummary"       
##  [4] "GeneCardsSummary"     "UniProtKB_Summary"    "EnsemblID"           
##  [7] "EntrezID"             "geneCount"            "sample"              
## [10] "sampleValue"          "group"                "groupMean"           
## [13] "groupMeanValue"       "foldChangeGroup"      "foldChangeGroupValue"

colnames(cv19b)

##  [1] "gene"                 "proteinSearched"      "EntrezSummary"       
##  [4] "GeneCardsSummary"     "UniProtKB_Summary"    "EnsemblID"           
##  [7] "EntrezID"             "geneCount"            "sample"              
## [10] "sampleValue"          "group"                "groupMean"           
## [13] "groupMeanValue"       "foldChangeGroup"      "foldChangeGroupValue"

colnames(RA)

##  [1] "gene"                 "proteinSearched"      "EntrezSummary"       
##  [4] "GeneCardsSummary"     "UniProtKB_Summary"    "EnsemblID"           
##  [7] "EntrezID"             "geneCount"            "sample"              
## [10] "sampleValue"          "group"                "groupMean"           
## [13] "groupMeanValue"       "foldChangeGroup"      "foldChangeGroupValue"

colnames(LD)

##  [1] "gene"                 "proteinSearched"      "EntrezSummary"       
##  [4] "GeneCardsSummary"     "UniProtKB_Summary"    "EnsemblID"           
##  [7] "EntrezID"             "geneCount"            "sample"              
## [10] "sampleValue"          "group"                "groupMean"           
## [13] "groupMeanValue"       "foldChangeGroup"      "foldChangeGroupValue"

colnames(UL1)

##  [1] "gene"                 "proteinSearched"      "EntrezSummary"       
##  [4] "GeneCardsSummary"     "UniProtKB_Summary"    "EnsemblID"           
##  [7] "EntrezID"             "geneCount"            "sample"              
## [10] "sampleValue"          "group"                "groupMean"           
## [13] "groupMeanValue"       "foldChangeGroup"      "foldChangeGroupValue"

colnames(UL2)

##  [1] "gene"                 "proteinSearched"      "EntrezSummary"       
##  [4] "GeneCardsSummary"     "UniProtKB_Summary"    "EnsemblID"           
##  [7] "EntrezID"             "geneCount"            "sample"              
## [10] "sampleValue"          "group"                "groupMean"           
## [13] "groupMeanValue"       "foldChangeGroup"      "foldChangeGroupValue"

sixStudiesAndSystemGenes <- rbind(uccd1,cv19b,RA,LD,UL1,UL2)

unique(sixStudiesAndSystemGenes$group)

##  [1] "Crohns Disease IBS"        "healthy non-IBS"          
##  [3] "mock IBS samples"          "ulcerative colitis IBS"   
##  [5] "severe cv19"               "healthy cv19"             
##  [7] "convalescent cv19"         "ICU cv19"                 
##  [9] "moderate cv19"             "treatment RA abatacept"   
## [11] "healthy RA"                "acute Lyme"               
## [13] "Lyme 1 month antibiotics"  "healthy Lyme"             
## [15] "Lyme 6 months antibiotics" "nonUL GSE120854"          
## [17] "UL GSE120854"              "healthy myo"              
## [19] "leiomyoma"

covid19g <- grep('cv19',sixStudiesAndSystemGenes$group)
fibroid1g <- grep('GSE120854',sixStudiesAndSystemGenes$group)
RAg <- grep(' RA',sixStudiesAndSystemGenes$group)
fibroid2g <- grep('myo',sixStudiesAndSystemGenes$group)
lymeg <- grep('Lyme',sixStudiesAndSystemGenes$group)
ibsg <- grep('IBS',sixStudiesAndSystemGenes$group)

sixStudiesAndSystemGenes$researchStudy <- 'researchStudy'

sixStudiesAndSystemGenes[covid19g,16] <- 'covid 19 GSE152418'
sixStudiesAndSystemGenes[fibroid1g,16] <- 'fibroid GSE120854'
sixStudiesAndSystemGenes[fibroid2g,16] <- 'fibroid GSE128242'
sixStudiesAndSystemGenes[ibsg,16] <- 'IBS GSE135223'
sixStudiesAndSystemGenes[lymeg,16] <- 'Lyme disease GSE145974'
sixStudiesAndSystemGenes[RAg,16] <- 'Rheumatoid Arthritis GSE151161'

unique(sixStudiesAndSystemGenes$researchStudy)

## [1] "IBS GSE135223"                  "covid 19 GSE152418"            
## [3] "Rheumatoid Arthritis GSE151161" "Lyme disease GSE145974"        
## [5] "fibroid GSE120854"              "fibroid GSE128242"

write.csv(sixStudiesAndSystemGenes,'sixStudiesGatheredFCsGrouped.csv',row.names=F)

I created a dashboard in Tableau that is available to explore on public.tableau.com.

image of dashboard of all 6 studies

In this study there are three filters at the top left of the dashboard that will filter by research study, then body system or genecards.org topic for top genes, and also by gene. You don’t have to select an item from each filter, but the three charts in the dashboard as well as the Gene Cards gene summary will zoom in on that filtered data if you do. To go back to all or to deselect, just select each selected item again to remove the highlight. There are two color legends. The middle left legend is for the fold change groups within each study that displays at the bottom left. The middle right legend displays the individual sample and group gene expression values on a log scale within Tableau’s chart options. Unfortunately some nulls appeared in Tableau when uploading the data approximately 2 GB in size. Sometimes it can be finicky on character strings as well, but all three gene summaries did not cause a problem in Tableau today.

Uterine Leiomyoma GSE128242 15 women ECM, COL4A5, COL4A6, and MED12

Janis Corona

9/9/2020