This is a quick addition as Part 2 to part 1, that is attached to the end of this part 1 version by a row of equal signs and 3 asterick marks, plus you can search for “Part 2” to get to that section. We just look at duplicate genes in at least one set and use those to keep in our pathologies database of genes related to uterine fibroid, add them to the Pathologies database, upload, and share.

This is meant to be a quick study and we will see what we find, but I was wondering if it is possible there is some connection between the tumor growth and associated genes involved in the studies I have been analyzing in the last few months that are related to Epstein-Barr viral (EBV) infection and latent EBV infection and associated pathologies. I have pulled this study up that says it compares the uterine fibroids of myometrial lining tissue in Black and White females. There are gene expression profiles to compare. There was a research article I can resort to for further details available at the NCBI site external link to the PMID number. This is GSE244187 for the study ID.

library(rmarkdown)

Lets look at what we have.

UL <- read.table("GSE244187_AlHendy_BulkTissue_Mar2021.featureCounts-genes.xls.gz", header=T)

str(UL)
## 'data.frame':    58735 obs. of  21 variables:
##  $ GeneID                                       : chr  "ENSG00000223972" "ENSG00000227232" "ENSG00000278267" "ENSG00000243485" ...
##  $ GeneSymbol                                   : chr  "DDX11L1" "WASH7P" "MIR6859-1" "MIR1302-2HG" ...
##  $ GeneBiotype                                  : chr  "transcribed_unprocessed_pseudogene" "unprocessed_pseudogene" "miRNA" "lincRNA" ...
##  $ rnamap.trim.MyoF.348_S12.geneAbundanceHisat2 : int  0 3 0 0 0 0 0 0 0 0 ...
##  $ rnamap.trim.MyoF.428_S11.geneAbundanceHisat2 : int  0 3 0 0 0 0 0 0 0 0 ...
##  $ rnamap.trim.MyoF.483_S8.geneAbundanceHisat2  : int  3 10 2 1 0 6 0 9 4 23 ...
##  $ rnamap.trim.MyoF.526_S10.geneAbundanceHisat2 : int  0 1 3 0 0 0 0 0 0 0 ...
##  $ rnamap.trim.MyoF.UI.10_S7.geneAbundanceHisat2: int  0 4 0 0 0 0 0 0 0 0 ...
##  $ rnamap.trim.MyoF.UI.13_S9.geneAbundanceHisat2: int  0 2 0 0 0 0 0 0 0 0 ...
##  $ rnamap.trim.MyoN.432_S4.geneAbundanceHisat2  : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ rnamap.trim.MyoN.514_S2.geneAbundanceHisat2  : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ rnamap.trim.MyoN.549_S5.geneAbundanceHisat2  : int  0 1 0 1 0 0 0 0 0 0 ...
##  $ rnamap.trim.MyoN.UI.20_S1.geneAbundanceHisat2: int  1 5 0 0 0 0 0 0 0 0 ...
##  $ rnamap.trim.MyoN.UI.43_S3.geneAbundanceHisat2: int  0 4 1 0 0 0 0 0 0 0 ...
##  $ rnamap.trim.MyoN.UI.8_S6.geneAbundanceHisat2 : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ rnamap.trim.UF.372_S18.geneAbundanceHisat2   : int  0 2 2 0 0 0 0 0 0 0 ...
##  $ rnamap.trim.UF.428_S17.geneAbundanceHisat2   : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ rnamap.trim.UF.483_S14.geneAbundanceHisat2   : int  0 2 0 0 0 0 0 0 0 1 ...
##  $ rnamap.trim.UF.526_S16.geneAbundanceHisat2   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ rnamap.trim.UF.UI.13_S15.geneAbundanceHisat2 : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ rnamap.trim.UF.UI.23_S13.geneAbundanceHisat2 : int  0 1 0 0 0 1 0 2 0 2 ...

This research study being analyzed uses uterine fibroid tissue, myometrial uterine tissue of normal uterus (not belonging to same uterus as UF or at risk of UF biopsied), and tissue at risk for uterine fibroid that is adjacent to the uterine fibroid in only Black and White women.

I tried to separate the family.soft file into samples with name and ID to describe the samples but the names of each feature already provides that information.

colnames(UL)
##  [1] "GeneID"                                       
##  [2] "GeneSymbol"                                   
##  [3] "GeneBiotype"                                  
##  [4] "rnamap.trim.MyoF.348_S12.geneAbundanceHisat2" 
##  [5] "rnamap.trim.MyoF.428_S11.geneAbundanceHisat2" 
##  [6] "rnamap.trim.MyoF.483_S8.geneAbundanceHisat2"  
##  [7] "rnamap.trim.MyoF.526_S10.geneAbundanceHisat2" 
##  [8] "rnamap.trim.MyoF.UI.10_S7.geneAbundanceHisat2"
##  [9] "rnamap.trim.MyoF.UI.13_S9.geneAbundanceHisat2"
## [10] "rnamap.trim.MyoN.432_S4.geneAbundanceHisat2"  
## [11] "rnamap.trim.MyoN.514_S2.geneAbundanceHisat2"  
## [12] "rnamap.trim.MyoN.549_S5.geneAbundanceHisat2"  
## [13] "rnamap.trim.MyoN.UI.20_S1.geneAbundanceHisat2"
## [14] "rnamap.trim.MyoN.UI.43_S3.geneAbundanceHisat2"
## [15] "rnamap.trim.MyoN.UI.8_S6.geneAbundanceHisat2" 
## [16] "rnamap.trim.UF.372_S18.geneAbundanceHisat2"   
## [17] "rnamap.trim.UF.428_S17.geneAbundanceHisat2"   
## [18] "rnamap.trim.UF.483_S14.geneAbundanceHisat2"   
## [19] "rnamap.trim.UF.526_S16.geneAbundanceHisat2"   
## [20] "rnamap.trim.UF.UI.13_S15.geneAbundanceHisat2" 
## [21] "rnamap.trim.UF.UI.23_S13.geneAbundanceHisat2"

These are RNA maps and trimmed for UF or uterine fibroid is first guess, then the MyoN is the normal myometrium, and the MyoF is the myometrial tissue next to fibroid. The samples page explains on each page what the sample is by race since this study is examining differences in race. The site does say the above abbreviations are as stated or guessed. Each sample when clicked will state the race. The first one is white ending in 414, white for 415, black for 416, white for 417, black for 418, black 419, white 420, black 421, white 422, black 423, black 424, white 425, white 426, white 427, black 428, white 429, black 430, and black 431.

Lets rename these columns by their race type and also trim the other title identifiers.

colnames(UL) <- gsub('rnamap.trim.','',colnames(UL))
colnames(UL) <- gsub('.geneAbundanceHisat2','', colnames(UL))

colnames(UL)
##  [1] "GeneID"        "GeneSymbol"    "GeneBiotype"   "MyoF.348_S12" 
##  [5] "MyoF.428_S11"  "MyoF.483_S8"   "MyoF.526_S10"  "MyoF.UI.10_S7"
##  [9] "MyoF.UI.13_S9" "MyoN.432_S4"   "MyoN.514_S2"   "MyoN.549_S5"  
## [13] "MyoN.UI.20_S1" "MyoN.UI.43_S3" "MyoN.UI.8_S6"  "UF.372_S18"   
## [17] "UF.428_S17"    "UF.483_S14"    "UF.526_S16"    "UF.UI.13_S15" 
## [21] "UF.UI.23_S13"

The first one is white ending in 414, white for 415, black for 416, white for 417, black for 418, black 419, white 420, black 421, white 422, black 423, black 424, white 425, white 426, white 427, black 428, white 429, black 430, and black 431.

white <- c(1,2,4,7,9,12,13,14,16)
black <- c(3,5,6,8,10,11,15,17,18)
names <- colnames(UL)[4:21]
names[white] <- paste(names[white],"white",sep='_')
names[black] <- paste(names[black],"black",sep='_')
colnames(UL)[4:21] <- names

colnames(UL)
##  [1] "GeneID"              "GeneSymbol"          "GeneBiotype"        
##  [4] "MyoF.348_S12_white"  "MyoF.428_S11_white"  "MyoF.483_S8_black"  
##  [7] "MyoF.526_S10_white"  "MyoF.UI.10_S7_black" "MyoF.UI.13_S9_black"
## [10] "MyoN.432_S4_white"   "MyoN.514_S2_black"   "MyoN.549_S5_white"  
## [13] "MyoN.UI.20_S1_black" "MyoN.UI.43_S3_black" "MyoN.UI.8_S6_white" 
## [16] "UF.372_S18_white"    "UF.428_S17_white"    "UF.483_S14_black"   
## [19] "UF.526_S16_white"    "UF.UI.13_S15_black"  "UF.UI.23_S13_black"

I am just using these next data frames as place holders to point out each sample in a set. But won’t be using these data frames for anything other than that.

normal <- UL[,c(10:15)]
UF <- UL[,c(16:21)]
UF_risk <- UL[,c(4:9)]

normal_white <- UL[,c(10,12,15)]
UF_white <- UL[,c(16,17,19)]
UF_risk_white <- UL[,c(4,5,7)]

normal_black <- UL[,c(11,13,14)]
UF_black <- UL[,c(18,20,21)]
UF_risk_black <- UL[,c(6,8,9)]

These samples are balanced equally as there are half White and half Black samples, as well as half the normal myometrial tissue is White and the other half is Black, the Uterine Fibroid tissue is divided equally and so is the myometrial adjacent to a uterine fibroid as at risk it is also equally divided in half for half are White and the other half is Black.

We can evenly split the gene expression values by race like the study and by tissue type. This is RNA high throughput sequencing of the myometrial tissue that was normal but in the same uterus (correction, not the same uterus, but a uterus that was removed for personal or other health reasons but a perfectly healthy uterus, like pelvic organ prolapse) as the uterine fibroid and adjacent tissue to the uterine fibroid.

Lets see what we find by mean value of the samples by tissue type then by tissue type specific to each race.

UL$normal_all_mean <- rowMeans(UL[,c(10:15)])
UL$UF_all_mean <- rowMeans(UL[,c(16:21)])
UL$UF_all_risk_mean <- rowMeans(UL[,c(4:9)])

UL$normal_white_mean <- rowMeans(UL[,c(10,12,15)])
UL$UF_white_mean <- rowMeans(UL[,c(16,17,19)])
UL$UF_risk_white_mean <- rowMeans(UL[,c(4,5,7)])

UL$normal_black_mean <- rowMeans(UL[,c(11,13,14)])
UL$UF_black_mean <- rowMeans(UL[,c(18,20,21)])
UL$UF_risk_black_mean <- rowMeans(UL[,c(6,8,9)])

colnames(UL)
##  [1] "GeneID"              "GeneSymbol"          "GeneBiotype"        
##  [4] "MyoF.348_S12_white"  "MyoF.428_S11_white"  "MyoF.483_S8_black"  
##  [7] "MyoF.526_S10_white"  "MyoF.UI.10_S7_black" "MyoF.UI.13_S9_black"
## [10] "MyoN.432_S4_white"   "MyoN.514_S2_black"   "MyoN.549_S5_white"  
## [13] "MyoN.UI.20_S1_black" "MyoN.UI.43_S3_black" "MyoN.UI.8_S6_white" 
## [16] "UF.372_S18_white"    "UF.428_S17_white"    "UF.483_S14_black"   
## [19] "UF.526_S16_white"    "UF.UI.13_S15_black"  "UF.UI.23_S13_black" 
## [22] "normal_all_mean"     "UF_all_mean"         "UF_all_risk_mean"   
## [25] "normal_white_mean"   "UF_white_mean"       "UF_risk_white_mean" 
## [28] "normal_black_mean"   "UF_black_mean"       "UF_risk_black_mean"

Now we can add the fold change values of the sample type to normal for UF and UF_risk for all, then by race within race of normal as this study did to see any differences.

UL$UF_normal_all_FC <- UL$UF_all_mean/UL$normal_all_mean
UL$UF_risk_normal_all_FC <- UL$UF_all_risk_mean/UL$normal_all_mean

UL$UF_normal_white_FC <- UL$UF_white_mean/UL$normal_white_mean
UL$UF_risk_white_FC <- UL$UF_risk_white_mean/UL$normal_white_mean

UL$UF_normal_black_FC <- UL$UF_black_mean/UL$normal_black_mean
UL$UF_risk_black_FC <- UL$UF_risk_black_mean/UL$normal_black_mean

colnames(UL)
##  [1] "GeneID"                "GeneSymbol"            "GeneBiotype"          
##  [4] "MyoF.348_S12_white"    "MyoF.428_S11_white"    "MyoF.483_S8_black"    
##  [7] "MyoF.526_S10_white"    "MyoF.UI.10_S7_black"   "MyoF.UI.13_S9_black"  
## [10] "MyoN.432_S4_white"     "MyoN.514_S2_black"     "MyoN.549_S5_white"    
## [13] "MyoN.UI.20_S1_black"   "MyoN.UI.43_S3_black"   "MyoN.UI.8_S6_white"   
## [16] "UF.372_S18_white"      "UF.428_S17_white"      "UF.483_S14_black"     
## [19] "UF.526_S16_white"      "UF.UI.13_S15_black"    "UF.UI.23_S13_black"   
## [22] "normal_all_mean"       "UF_all_mean"           "UF_all_risk_mean"     
## [25] "normal_white_mean"     "UF_white_mean"         "UF_risk_white_mean"   
## [28] "normal_black_mean"     "UF_black_mean"         "UF_risk_black_mean"   
## [31] "UF_normal_all_FC"      "UF_risk_normal_all_FC" "UF_normal_white_FC"   
## [34] "UF_risk_white_FC"      "UF_normal_black_FC"    "UF_risk_black_FC"

We have 6 different fold change values, and will order them by the UF compared to normal for all, then white, and then black separately to get the top genes of top 10 up regulated and top 10 down regulated in a Uterine Fibroid compared to normal tissue. Lets write this file out to csv to have once we order it by all first.

UL_all_ordered <- UL[order(UL$UF_normal_all_FC, decreasing=T),]

UL_all_filtered <- UL[!(is.na(UL$UF_normal_all_FC)),]
UL_all_filtered1 <- UL_all_filtered[!(is.infinite(UL_all_filtered$UF_normal_all_FC)),]

UL_all_nozeros <- UL_all_filtered1[UL_all_filtered1$UF_normal_all_FC>0,]

UL_all_nozeros1 <- UL_all_nozeros[order(UL_all_nozeros$UF_normal_all_FC, decreasing=T),]

UL_all_top20 <- UL_all_nozeros1[c(1:10,38196:38205),]

paged_table(UL_all_top20)

Lets write out the files we have so far.

write.csv(UL, 'UL_all_FCs_58735_notFiltered_hasNaNs_hasINf.csv',row.names=F)

write.csv(UL_all_nozeros1,'UL_normal_allRaces_Foldchanges_GSE244187_filtered.csv',row.names=F)

write.csv(UL_all_top20, 'UL_normal_allRaces_Top20.csv',row.names=F)

Now lets do the same for the white UL compared to normal fold change values and the black UL compared to normal fold change values.

UL_normal_white_FC <- UL[!(is.na(UL$UF_normal_white_FC)),]
UL_normal_white_FC1 <- UL_normal_white_FC[!(is.infinite(UL_normal_white_FC$UF_normal_white_FC)),]

UL_normal_white_FC2 <- UL_normal_white_FC1[UL_normal_white_FC1$UF_normal_white_FC >0,]

UL_normal_white_FC3 <- UL_normal_white_FC2[order(UL_normal_white_FC2$UF_normal_white_FC, decreasing=T),]

UL_normal_white_top20 <- UL_normal_white_FC3[c(1:10,27051:27060),]

paged_table(UL_normal_white_top20)
write.csv(UL_normal_white_top20, 'UL_normal_white_Top20.csv',row.names=F)

Now lets get the Black fold change and top 20 genes by UL compared to normal tissue.

UL_normal_black_FC <- UL[!(is.na(UL$UF_normal_black_FC)),]
UL_normal_black_FC1 <- UL_normal_black_FC[!(is.infinite(UL_normal_black_FC$UF_normal_black_FC)),]

UL_normal_black_FC2 <- UL_normal_black_FC1[UL_normal_black_FC1$UF_normal_black_FC > 0,]

UL_normal_black_FC3 <- UL_normal_black_FC2[order(UL_normal_black_FC2$UF_normal_black_FC, decreasing=T),]

UL_normal_black_top20 <- UL_normal_black_FC3[c(1:10,29867:29876),]

paged_table(UL_normal_black_top20)

Lets write this file of top 20 genes for Black women Uterine Fibroids.

write.csv(UL_normal_black_top20,'UL_normal_black_Top20.csv', row.names=F)

Lets combine the top20 for all, white, and black. We left the UL risk genes alone, but we could also look at those after we combine these genes.

Keep only the respective mean and fold change values and add a column to each subtable to identify its gene information set obtained, i.e. all, white, black foldchange values.

table_all <- UL_all_top20[,c(1:3,22,23,31)]
table_all$foldchange_source <- "fibroid and normal all samples"

colnames(table_all)[4:6] <- c('normal_mean','fibroid_mean','foldchange_fibroid_vs_normal')

paged_table(table_all)
table_white <- UL_normal_white_top20[,c(1:3,25,26,33)]
table_white$foldchange_source <- "fibroid and normal White samples"

colnames(table_white)[4:6] <- c('normal_mean','fibroid_mean','foldchange_fibroid_vs_normal')

paged_table(table_white)
table_black <- UL_normal_black_top20[,c(1:3,28,29,35)]
table_black$foldchange_source <- "fibroid and normal Black samples"

colnames(table_black)[4:6] <- c('normal_mean','fibroid_mean','foldchange_fibroid_vs_normal')

paged_table(table_black)

Combine these genes together.

all60 <- rbind(table_all, table_white, table_black)

paged_table(all60)
write.csv(all60, "all60_topGenes_all_White_Black.csv", row.names=F)

Let me share these tables with you before proceeding through a comparison with the genes in our studies that have been shown to be associated with the pathologies we have looked at so far.

The large 58k gene database of fold change and mean values with all NaNs and Infinite values is here.

The large gene database without Infinite and NaN values only for the fibroid vs normal fold change values is here.

And the top 60 genes for all races, White, and Black separately is combined in the data table here.

=============================================

Now lets compare the genes in the fold change values of the unfiltered data with the genes of our pathologies data base and see if there are any findings of importance.

You can get access to our pathologies database thus for our current addition as the last data we added was the Hodgkin’s data with EBV and HIV.

Lets read in our pathologies database.

path <- "C:...current Pathologies Database/" #copy path to the downloadable pathologies file
setwd(path)

pathologies <- read.csv("pathologies_Hodgkin_added_cHL_EBV_HIV_3-27-2026.csv", header=T, sep=',')
str(pathologies)
## 'data.frame':    289 obs. of  7 variables:
##  $ Ensembl_ID          : chr  "ENSG00000211899" "ENSG00000164458" "ENSG00000211644" "ENSG00000125869" ...
##  $ Genecards_ID        : chr  "IGHM" "TBXT" "IGLV1-51" "LAMP5" ...
##  $ FC_pathology_control: num  18550 1051 179 140 105 ...
##  $ topGenePathology    : chr  "Epstein Barr Virus" "Epstein Barr Virus" "Epstein Barr Virus" "Epstein Barr Virus" ...
##  $ mediaType           : chr  "LCLs of PBMCs RNA-Seq format" "LCLs of PBMCs RNA-Seq format" "LCLs of PBMCs RNA-Seq format" "LCLs of PBMCs RNA-Seq format" ...
##  $ studySummarized     : chr  "The EBV or Epstein-Barr Viral infected samples were obtained from lymphoblastic cells in peripheral blood monon"| __truncated__ "The EBV or Epstein-Barr Viral infected samples were obtained from lymphoblastic cells in peripheral blood monon"| __truncated__ "The EBV or Epstein-Barr Viral infected samples were obtained from lymphoblastic cells in peripheral blood monon"| __truncated__ "The EBV or Epstein-Barr Viral infected samples were obtained from lymphoblastic cells in peripheral blood monon"| __truncated__ ...
##  $ GSE_study_ID        : chr  "GSE253756" "GSE253756" "GSE253756" "GSE253756" ...
genesPathology <- pathologies$Genecards_ID

genesPathology
##   [1] "IGHM"            "TBXT"            "IGLV1-51"        "LAMP5"          
##   [5] "ICOS"            "PACSIN1"         "EFR3B"           "SIRPB1"         
##   [9] "ISX"             "MIR4537"         "LINC00540"       "DTHD1"          
##  [13] "SCT"             "DPYSL4"          "NBPF3"           "THBS1"          
##  [17] "SLC16A14"        "GIMAP7"          "SERPINB2"        "LINC00327"      
##  [21] "GNG12"           "HMGA2"           "ENSG00000255026" "CPA4"           
##  [25] "IGF1"            "AKAP6"           "RGPD2"           "EFEMP1"         
##  [29] "ADGRE4P"         "TMEM132B"        "LINC02898"       "TMPRSS3"        
##  [33] "CCL20"           "CD93"            "CACNA1B"         "MUC20P1"        
##  [37] "LOC102724560"    "TIGIT"           "KLHL13"          "ENSG00000261471"
##  [41] "CHL1-AS2"        "GOLT1A"          "GPR171"          "DZIP1"          
##  [45] "DTNA"            "TINAG"           "CLU"             "MIR503HG"       
##  [49] "RBPMS"           "PHLDB2"          "GPRIN3"          "TBX15"          
##  [53] "LINC01515"       "PALLD"           "COL27A1"         "CDCP1"          
##  [57] "USP6"            "SLC4A4"          "SORBS2"          "MDGA1"          
##  [61] "HOGA1"           "ADGRA3"          "BCO1"            "FKBP10"         
##  [65] "TRIM71"          "TEX15"           "RHOU"            "IGHG3"          
##  [69] "COL4A1"          "PXDN"            "ME3"             "PTPRG-AS1"      
##  [73] "FREM1"           "DDO"             "PRKY"            "IGHV3-30"       
##  [77] "IGHG2"           "IGHGP"           "IGHG1"           "IGHG4"          
##  [81] "UPK3BL1"         "ENSG00000268292" "ENSG00000268292" "LOC101928819"   
##  [85] "LOC101928819"    "TVP23A"          "TVP23A"          "GNAS-AS1"       
##  [89] "GNAS-AS1"        "HOXC10"          "HOXC10"          "HOXC10"         
##  [93] "CXCL2"           "CSF3"            "CH25H"           "ISG20"          
##  [97] "CLEC2L"          "PSMF1"           "RNF168"          "PEX26"          
## [101] "F2"              "KCNJ16"          "MAP2K7"          "ESYT1"          
## [105] "GATC"            "ENO1"            "CYP7B1"          "IGFALS"         
## [109] "OR52A4"          "INAFM1"          "DLG3"            "TMEM194A"       
## [113] "RGPD3"           "HPGD"            "SLC1A1"          "NUDT18"         
## [117] "LOC400657"       "OTOS"            "HECW1"           "POU4F2"         
## [121] "FRS3"            "PDZRN3"          "KHDRBS3"         "CENPF"          
## [125] "FAM162A"         "CABP1"           "POU3F2"          "CTXN3"          
## [129] "CLINT1"          "IGFBP7"          "FAM20C"          "SMARCA2"        
## [133] "HSD3B1"          "ST3GAL3"         "TSHZ2"           "NLRP3"          
## [137] "POM121L15P"      "CASC8"           "ESRP1"           "RPL7P61"        
## [141] "PLSCR5"          "CPA6"            "ERBB4"           "CACNA1E"        
## [145] "TMEM200A"        "NDUFB5P1"        "STX12"           "CRLF3P3"        
## [149] "CDH8"            " PIK3C2A"        "KCNA6-AS1"       "SLIRPP1"        
## [153] "PDE4DIPP5"       "ST8SIA4"         "HDAC4"           "SCHLAP1"        
## [157] "ATP5MC3"         "CAMKMT"          "AQP12B"          "LZIC"           
## [161] "WDR35"           "TFPI"            "ACTN4"           "TSNARE1"        
## [165] "MIR4432HG"       "RPL31P30"        "RNU6-280P"       "KLHL29"         
## [169] "DDAH1"           "MIR382"          "MIR584"          "MIR1973"        
## [173] "MIR382"          "MIR432"          "MIR432"          "CCND1"          
## [177] "CCND1"           "MIR382"          "MIR409"          "MIR664B"        
## [181] "MIR382"          "MIR432"          "MIR489"          "CCND1"          
## [185] "ANKRD22"         "CXCL10"          "IFI27"           "IL1R2"          
## [189] "NCKAP5"          "FCER1A"          "ENSG00000286797" "LOC102724019"   
## [193] "NRCAM"           "PIRAT1"          "LYZ"             "lnc-RSPH14-1"   
## [197] "SOX5"            "PPBP"            "IGKC"            "lnc-RSPH14-1"   
## [201] "HSALNG0035179"   "LINGO2"          "LOC105377276"    "ENSG00000167522"
## [205] "ENSG00000170458" "ENSG00000174059" "ENSG00000086548" "ENSG00000182578"
## [209] "ENSG00000049768" "ENSG00000231389" "ENSG00000104432" "ENSG00000213626"
## [213] "ENSG00000226979" "ENSG00000116132" "ENSG00000123892" "ENSG00000154764"
## [217] "AC139530.2"      "AC011511.4"      "AC092143.1"      "AL645465.1"     
## [221] "AL353997.3"      "AC022384.1"      "MPO"             "TRIM6-TRIM34"   
## [225] "FUT4"            "ITGA4"           "RPL23AP34"       "AC027088.3"     
## [229] "MYOM3"           "CALCR"           "MOXD1"           "SMAD5-AS1"      
## [233] "HSPE1P7"         "AC026464.2"      "RASA4B"          "AC026954.2"     
## [237] "AL353997.3"      "AICDA"           "ARID1B"          "CARD11"         
## [241] "CD4"             "COL16A1"         "CPLX2"           "CREBBP"         
## [245] "EP300"           "EZH2"            "FAM72A"          "KDM5B"          
## [249] "KMT2D"           "LERFS"           "MUC16"           "NCOR2"          
## [253] "NR4A2"           "POU2F3"          "RPS4Y1"          "SLC18A2"        
## [257] "TIGIT"           "TP53"            "TTN"             "VAMP5"          
## [261] "XBP1"            "Z82243.1"        "ZFHX3"           "ARID1B"         
## [265] "CARD11"          "CCDC8"           "CD4"             "CREBBP"         
## [269] "CTSLP3"          "EP300"           "EZH2"            "FAM72A"         
## [273] "HNRNPA1P70"      "IGKV1D-39"       "IGKV3-20"        "KDM5B"          
## [277] "KMT2D"           "LTF"             "MMP8"            "MPO"            
## [281] "MUC16"           "NCOR2"           "PES1P2"          "RPS4Y1"         
## [285] "TIGIT"           "TP53"            "TTN"             "XBP1"           
## [289] "ZFHX3"

I noticed an error in the dataset where the Ensembl IDs are in the Genecards IDs and vice versa. We need to swith this, and could go back to that file but we will do it here and upload it again. The new downloaded file won’t show it, or I could just make a separate link. I will do that.

genecards <- pathologies$Ensembl_ID[204:216]
ensembls <- pathologies$Genecards_ID[204:216]

pathologies$Genecards_ID[204:216] <- genecards
pathologies$Ensembl_ID[204:216] <- ensembls

pathologies$Genecards_ID
##   [1] "IGHM"            "TBXT"            "IGLV1-51"        "LAMP5"          
##   [5] "ICOS"            "PACSIN1"         "EFR3B"           "SIRPB1"         
##   [9] "ISX"             "MIR4537"         "LINC00540"       "DTHD1"          
##  [13] "SCT"             "DPYSL4"          "NBPF3"           "THBS1"          
##  [17] "SLC16A14"        "GIMAP7"          "SERPINB2"        "LINC00327"      
##  [21] "GNG12"           "HMGA2"           "ENSG00000255026" "CPA4"           
##  [25] "IGF1"            "AKAP6"           "RGPD2"           "EFEMP1"         
##  [29] "ADGRE4P"         "TMEM132B"        "LINC02898"       "TMPRSS3"        
##  [33] "CCL20"           "CD93"            "CACNA1B"         "MUC20P1"        
##  [37] "LOC102724560"    "TIGIT"           "KLHL13"          "ENSG00000261471"
##  [41] "CHL1-AS2"        "GOLT1A"          "GPR171"          "DZIP1"          
##  [45] "DTNA"            "TINAG"           "CLU"             "MIR503HG"       
##  [49] "RBPMS"           "PHLDB2"          "GPRIN3"          "TBX15"          
##  [53] "LINC01515"       "PALLD"           "COL27A1"         "CDCP1"          
##  [57] "USP6"            "SLC4A4"          "SORBS2"          "MDGA1"          
##  [61] "HOGA1"           "ADGRA3"          "BCO1"            "FKBP10"         
##  [65] "TRIM71"          "TEX15"           "RHOU"            "IGHG3"          
##  [69] "COL4A1"          "PXDN"            "ME3"             "PTPRG-AS1"      
##  [73] "FREM1"           "DDO"             "PRKY"            "IGHV3-30"       
##  [77] "IGHG2"           "IGHGP"           "IGHG1"           "IGHG4"          
##  [81] "UPK3BL1"         "ENSG00000268292" "ENSG00000268292" "LOC101928819"   
##  [85] "LOC101928819"    "TVP23A"          "TVP23A"          "GNAS-AS1"       
##  [89] "GNAS-AS1"        "HOXC10"          "HOXC10"          "HOXC10"         
##  [93] "CXCL2"           "CSF3"            "CH25H"           "ISG20"          
##  [97] "CLEC2L"          "PSMF1"           "RNF168"          "PEX26"          
## [101] "F2"              "KCNJ16"          "MAP2K7"          "ESYT1"          
## [105] "GATC"            "ENO1"            "CYP7B1"          "IGFALS"         
## [109] "OR52A4"          "INAFM1"          "DLG3"            "TMEM194A"       
## [113] "RGPD3"           "HPGD"            "SLC1A1"          "NUDT18"         
## [117] "LOC400657"       "OTOS"            "HECW1"           "POU4F2"         
## [121] "FRS3"            "PDZRN3"          "KHDRBS3"         "CENPF"          
## [125] "FAM162A"         "CABP1"           "POU3F2"          "CTXN3"          
## [129] "CLINT1"          "IGFBP7"          "FAM20C"          "SMARCA2"        
## [133] "HSD3B1"          "ST3GAL3"         "TSHZ2"           "NLRP3"          
## [137] "POM121L15P"      "CASC8"           "ESRP1"           "RPL7P61"        
## [141] "PLSCR5"          "CPA6"            "ERBB4"           "CACNA1E"        
## [145] "TMEM200A"        "NDUFB5P1"        "STX12"           "CRLF3P3"        
## [149] "CDH8"            " PIK3C2A"        "KCNA6-AS1"       "SLIRPP1"        
## [153] "PDE4DIPP5"       "ST8SIA4"         "HDAC4"           "SCHLAP1"        
## [157] "ATP5MC3"         "CAMKMT"          "AQP12B"          "LZIC"           
## [161] "WDR35"           "TFPI"            "ACTN4"           "TSNARE1"        
## [165] "MIR4432HG"       "RPL31P30"        "RNU6-280P"       "KLHL29"         
## [169] "DDAH1"           "MIR382"          "MIR584"          "MIR1973"        
## [173] "MIR382"          "MIR432"          "MIR432"          "CCND1"          
## [177] "CCND1"           "MIR382"          "MIR409"          "MIR664B"        
## [181] "MIR382"          "MIR432"          "MIR489"          "CCND1"          
## [185] "ANKRD22"         "CXCL10"          "IFI27"           "IL1R2"          
## [189] "NCKAP5"          "FCER1A"          "ENSG00000286797" "LOC102724019"   
## [193] "NRCAM"           "PIRAT1"          "LYZ"             "lnc-RSPH14-1"   
## [197] "SOX5"            "PPBP"            "IGKC"            "lnc-RSPH14-1"   
## [201] "HSALNG0035179"   "LINGO2"          "LOC105377276"    "ANKRD11"        
## [205] "CD14"            "CD34"            "CEACAM6"         "CSF1R"          
## [209] "FOXP3"           "HLA-DPA1"        "IL7"             "LBH"            
## [213] "LTA"             "PRRX1"           "RAB38"           "WNT7A"          
## [217] "AC139530.2"      "AC011511.4"      "AC092143.1"      "AL645465.1"     
## [221] "AL353997.3"      "AC022384.1"      "MPO"             "TRIM6-TRIM34"   
## [225] "FUT4"            "ITGA4"           "RPL23AP34"       "AC027088.3"     
## [229] "MYOM3"           "CALCR"           "MOXD1"           "SMAD5-AS1"      
## [233] "HSPE1P7"         "AC026464.2"      "RASA4B"          "AC026954.2"     
## [237] "AL353997.3"      "AICDA"           "ARID1B"          "CARD11"         
## [241] "CD4"             "COL16A1"         "CPLX2"           "CREBBP"         
## [245] "EP300"           "EZH2"            "FAM72A"          "KDM5B"          
## [249] "KMT2D"           "LERFS"           "MUC16"           "NCOR2"          
## [253] "NR4A2"           "POU2F3"          "RPS4Y1"          "SLC18A2"        
## [257] "TIGIT"           "TP53"            "TTN"             "VAMP5"          
## [261] "XBP1"            "Z82243.1"        "ZFHX3"           "ARID1B"         
## [265] "CARD11"          "CCDC8"           "CD4"             "CREBBP"         
## [269] "CTSLP3"          "EP300"           "EZH2"            "FAM72A"         
## [273] "HNRNPA1P70"      "IGKV1D-39"       "IGKV3-20"        "KDM5B"          
## [277] "KMT2D"           "LTF"             "MMP8"            "MPO"            
## [281] "MUC16"           "NCOR2"           "PES1P2"          "RPS4Y1"         
## [285] "TIGIT"           "TP53"            "TTN"             "XBP1"           
## [289] "ZFHX3"
pathologies$Ensembl_ID
##   [1] "ENSG00000211899" "ENSG00000164458" "ENSG00000211644" "ENSG00000125869"
##   [5] "ENSG00000163600" "ENSG00000124507" "ENSG00000084710" "ENSG00000101307"
##   [9] "ENSG00000175329" "ENSG00000264781" "ENSG00000276476" "ENSG00000197057"
##  [13] "ENSG00000070031" "ENSG00000151640" "ENSG00000142794" "ENSG00000137801"
##  [17] "ENSG00000163053" "ENSG00000179144" "ENSG00000197632" "ENSG00000232977"
##  [21] "ENSG00000172380" "ENSG00000149948" "ENSG00000255026" "ENSG00000128510"
##  [25] "ENSG00000017427" "ENSG00000151320" "ENSG00000185304" "ENSG00000115380"
##  [29] "ENSG00000268758" "ENSG00000139364" "ENSG00000205086" "ENSG00000160183"
##  [33] "ENSG00000115009" "ENSG00000125810" "ENSG00000148408" "ENSG00000224769"
##  [37] "ENSG00000274276" "ENSG00000181847" "ENSG00000003096" "ENSG00000261471"
##  [41] "ENSG00000224318" "ENSG00000174567" "ENSG00000174946" "ENSG00000134874"
##  [45] "ENSG00000134769" "ENSG00000137251" "ENSG00000120885" "ENSG00000223749"
##  [49] "ENSG00000157110" "ENSG00000144824" "ENSG00000185477" "ENSG00000092607"
##  [53] "ENSG00000228065" "ENSG00000129116" "ENSG00000196739" "ENSG00000163814"
##  [57] "ENSG00000129204" "ENSG00000080493" "ENSG00000154556" "ENSG00000112139"
##  [61] "ENSG00000241935" "ENSG00000152990" "ENSG00000135697" "ENSG00000141756"
##  [65] "ENSG00000206557" "ENSG00000133863" "ENSG00000116574" "ENSG00000211897"
##  [69] "ENSG00000187498" "ENSG00000130508" "ENSG00000151376" "ENSG00000241472"
##  [73] "ENSG00000164946" "ENSG00000203797" "ENSG00000099725" "ENSG00000270550"
##  [77] "ENSG00000211893" "ENSG00000253755" "ENSG00000211896" "ENSG00000211892"
##  [81] "ENSG00000267368" "ENSG00000268292" "ENSG00000268292" "ENSG00000250978"
##  [85] "ENSG00000250978" "ENSG00000166676" "ENSG00000166676" "ENSG00000235590"
##  [89] "ENSG00000235590" "ENSG00000180818" "ENSG00000180818" "ENSG00000180818"
##  [93] "ENSG00000081041" "ENSG00000108342" "ENSG00000138135" "ENSG00000172183"
##  [97] "ENSG00000236279" "ENSG00000125818" "ENSG00000163961" "ENSG00000215193"
## [101] "ENSG00000180210" "ENSG00000153822" "ENSG00000076984" "ENSG00000139641"
## [105] "ENSG00000257218" "ENSG00000074800" "ENSG00000172817" "ENSG00000099769"
## [109] "ENSG00000205494" "ENSG00000257704" "ENSG00000082458" "ENSG00000304975"
## [113] "ENSG00000153165" "ENSG00000164120" "ENSG00000106688" "ENSG00000275074"
## [117] "lysate"          "ENSG00000178602" "ENSG00000002746" "ENSG00000151615"
## [121] "ENSG00000137218" "ENSG00000121440" "ENSG00000131773" "ENSG00000117724"
## [125] "ENSG00000114023" "ENSG00000157782" "ENSG00000184486" "ENSG00000205279"
## [129] "ENSG00000113282" "ENSG00000163453" "ENSG00000177706" "ENSG00000080503"
## [133] "ENSG00000203857" "ENSG00000126091" "ENSG00000182463" "ENSG00000162711"
## [137] "ENSG00000161103" "ENSG00000246228" "ENSG00000104413" "ENSG00000230282"
## [141] "ENSG00000231213" "ENSG00000165078" "ENSG00000178568" "ENSG00000198216"
## [145] "ENSG00000164484" "ENSG00000251025" "ENSG00000117758" "ENSG00000228225"
## [149] "ENSG00000150394" "ENSG00000011405" "ENSG00000256988" "ENSG00000227505"
## [153] "ENSG00000275064" "ENSG00000113532" "ENSG00000068024" "ENSG00000281131"
## [157] "ENSG00000154518" "ENSG00000143919" "ENSG00000185176" "ENSG00000162441"
## [161] "ENSG00000118965" "ENSG00000003436" "ENSG00000130402" "ENSG00000171045"
## [165] "ENSG00000228590" "ENSG00000230702" "ENSG00000201015" "ENSG00000119771"
## [169] "ENSG00000153904" "ENSG00000283170" "ENSG00000207714" "ENSG00000284253"
## [173] "ENSG00000283170" "ENSG00000272458" "ENSG00000272458" "ENSG00000110092"
## [177] "ENSG00000110092" "ENSG00000283170" "ENSG00000199107" "ENSG00000284450"
## [181] "ENSG00000283170" "ENSG00000272458" "ENSG00000207656" "ENSG00000110092"
## [185] "ENSG00000152766" "ENSG00000169245" "ENSG00000165949" "ENSG00000115590"
## [189] "ENSG00000176771" "ENSG00000179639" "ENSG00000286797" "ENSG00000240086"
## [193] "ENSG00000303545" "ENSG00000237803" "ENSG00000257764" NA               
## [197] "ENSG00000256473" "ENSG00000287037" "ENSG00000295771" NA               
## [201] NA                "ENSG00000302413" "ENSG00000304732" "ENSG00000167522"
## [205] "ENSG00000170458" "ENSG00000174059" "ENSG00000086548" "ENSG00000182578"
## [209] "ENSG00000049768" "ENSG00000231389" "ENSG00000104432" "ENSG00000213626"
## [213] "ENSG00000226979" "ENSG00000116132" "ENSG00000123892" "ENSG00000154764"
## [217] "ENSG00000262660" "ENSG00000267303" "ENSG00000198211" "ENSG00000240963"
## [221] "ENSG00000267441" "ENSG00000272410" "ENSG00000005381" "ENSG00000258588"
## [225] "ENSG00000196371" "ENSG00000115232" "ENSG00000225991" "ENSG00000259265"
## [229] "ENSG00000142661" "ENSG00000004948" "ENSG00000079931" "ENSG00000164621"
## [233] "ENSG00000270945" "ENSG00000260108" "ENSG00000170667" "ENSG00000261915"
## [237] "ENSG00000279999" "ENSG00000111732" "ENSG00000049618" "ENSG00000198286"
## [241] "ENSG00000010610" "ENSG00000084636" "ENSG00000145920" "ENSG00000005339"
## [245] "ENSG00000100393" "ENSG00000106462" "ENSG00000196550" "ENSG00000117139"
## [249] "ENSG00000167548" "ENSG00000234665" "ENSG00000181143" "ENSG00000196498"
## [253] "ENSG00000153234" "ENSG00000137709" "ENSG00000129824" "ENSG00000165646"
## [257] "ENSG00000181847" "ENSG00000141510" "ENSG00000155657" "ENSG00000168899"
## [261] "ENSG00000100219" "ENSG00000273243" "ENSG00000140836" "ENSG00000049618"
## [265] "ENSG00000198286" "ENSG00000169515" "ENSG00000010610" "ENSG00000005339"
## [269] "ENSG00000280913" "ENSG00000100393" "ENSG00000106462" "ENSG00000196550"
## [273] "ENSG00000236946" "ENSG00000251546" "ENSG00000239951" "ENSG00000117139"
## [277] "ENSG00000167548" "ENSG00000012223" "ENSG00000118113" "ENSG00000005381"
## [281] "ENSG00000181143" "ENSG00000196498" "ENSG00000229268" "ENSG00000129824"
## [285] "ENSG00000181847" "ENSG00000141510" "ENSG00000155657" "ENSG00000100219"
## [289] "ENSG00000140836"

Looks ok because some of the Genecards IDs don’t have names other than Ensembl IDs.

write.csv(pathologies, 'pathologies_edited_3-31-2026.csv', row.names=F)

You can get that file here.

The LMP1 gene isn’t in the database of pathologies but the genes from the study are that the study found relevant. Those were the genes we switched from Ensemble IDs to the Genecards IDs because that is what they are.

Lets see if the KDM5B, LMP1, and other genes of the last studies are in this study, so we will add them to the list of Genecards IDs to select from the fibroid data on fold change values comparing uterine fibroid to normal tissue in same uterus of both White and Black females.

theList <- c(pathologies$Genecards_ID,"LMP1")

theList
##   [1] "IGHM"            "TBXT"            "IGLV1-51"        "LAMP5"          
##   [5] "ICOS"            "PACSIN1"         "EFR3B"           "SIRPB1"         
##   [9] "ISX"             "MIR4537"         "LINC00540"       "DTHD1"          
##  [13] "SCT"             "DPYSL4"          "NBPF3"           "THBS1"          
##  [17] "SLC16A14"        "GIMAP7"          "SERPINB2"        "LINC00327"      
##  [21] "GNG12"           "HMGA2"           "ENSG00000255026" "CPA4"           
##  [25] "IGF1"            "AKAP6"           "RGPD2"           "EFEMP1"         
##  [29] "ADGRE4P"         "TMEM132B"        "LINC02898"       "TMPRSS3"        
##  [33] "CCL20"           "CD93"            "CACNA1B"         "MUC20P1"        
##  [37] "LOC102724560"    "TIGIT"           "KLHL13"          "ENSG00000261471"
##  [41] "CHL1-AS2"        "GOLT1A"          "GPR171"          "DZIP1"          
##  [45] "DTNA"            "TINAG"           "CLU"             "MIR503HG"       
##  [49] "RBPMS"           "PHLDB2"          "GPRIN3"          "TBX15"          
##  [53] "LINC01515"       "PALLD"           "COL27A1"         "CDCP1"          
##  [57] "USP6"            "SLC4A4"          "SORBS2"          "MDGA1"          
##  [61] "HOGA1"           "ADGRA3"          "BCO1"            "FKBP10"         
##  [65] "TRIM71"          "TEX15"           "RHOU"            "IGHG3"          
##  [69] "COL4A1"          "PXDN"            "ME3"             "PTPRG-AS1"      
##  [73] "FREM1"           "DDO"             "PRKY"            "IGHV3-30"       
##  [77] "IGHG2"           "IGHGP"           "IGHG1"           "IGHG4"          
##  [81] "UPK3BL1"         "ENSG00000268292" "ENSG00000268292" "LOC101928819"   
##  [85] "LOC101928819"    "TVP23A"          "TVP23A"          "GNAS-AS1"       
##  [89] "GNAS-AS1"        "HOXC10"          "HOXC10"          "HOXC10"         
##  [93] "CXCL2"           "CSF3"            "CH25H"           "ISG20"          
##  [97] "CLEC2L"          "PSMF1"           "RNF168"          "PEX26"          
## [101] "F2"              "KCNJ16"          "MAP2K7"          "ESYT1"          
## [105] "GATC"            "ENO1"            "CYP7B1"          "IGFALS"         
## [109] "OR52A4"          "INAFM1"          "DLG3"            "TMEM194A"       
## [113] "RGPD3"           "HPGD"            "SLC1A1"          "NUDT18"         
## [117] "LOC400657"       "OTOS"            "HECW1"           "POU4F2"         
## [121] "FRS3"            "PDZRN3"          "KHDRBS3"         "CENPF"          
## [125] "FAM162A"         "CABP1"           "POU3F2"          "CTXN3"          
## [129] "CLINT1"          "IGFBP7"          "FAM20C"          "SMARCA2"        
## [133] "HSD3B1"          "ST3GAL3"         "TSHZ2"           "NLRP3"          
## [137] "POM121L15P"      "CASC8"           "ESRP1"           "RPL7P61"        
## [141] "PLSCR5"          "CPA6"            "ERBB4"           "CACNA1E"        
## [145] "TMEM200A"        "NDUFB5P1"        "STX12"           "CRLF3P3"        
## [149] "CDH8"            " PIK3C2A"        "KCNA6-AS1"       "SLIRPP1"        
## [153] "PDE4DIPP5"       "ST8SIA4"         "HDAC4"           "SCHLAP1"        
## [157] "ATP5MC3"         "CAMKMT"          "AQP12B"          "LZIC"           
## [161] "WDR35"           "TFPI"            "ACTN4"           "TSNARE1"        
## [165] "MIR4432HG"       "RPL31P30"        "RNU6-280P"       "KLHL29"         
## [169] "DDAH1"           "MIR382"          "MIR584"          "MIR1973"        
## [173] "MIR382"          "MIR432"          "MIR432"          "CCND1"          
## [177] "CCND1"           "MIR382"          "MIR409"          "MIR664B"        
## [181] "MIR382"          "MIR432"          "MIR489"          "CCND1"          
## [185] "ANKRD22"         "CXCL10"          "IFI27"           "IL1R2"          
## [189] "NCKAP5"          "FCER1A"          "ENSG00000286797" "LOC102724019"   
## [193] "NRCAM"           "PIRAT1"          "LYZ"             "lnc-RSPH14-1"   
## [197] "SOX5"            "PPBP"            "IGKC"            "lnc-RSPH14-1"   
## [201] "HSALNG0035179"   "LINGO2"          "LOC105377276"    "ANKRD11"        
## [205] "CD14"            "CD34"            "CEACAM6"         "CSF1R"          
## [209] "FOXP3"           "HLA-DPA1"        "IL7"             "LBH"            
## [213] "LTA"             "PRRX1"           "RAB38"           "WNT7A"          
## [217] "AC139530.2"      "AC011511.4"      "AC092143.1"      "AL645465.1"     
## [221] "AL353997.3"      "AC022384.1"      "MPO"             "TRIM6-TRIM34"   
## [225] "FUT4"            "ITGA4"           "RPL23AP34"       "AC027088.3"     
## [229] "MYOM3"           "CALCR"           "MOXD1"           "SMAD5-AS1"      
## [233] "HSPE1P7"         "AC026464.2"      "RASA4B"          "AC026954.2"     
## [237] "AL353997.3"      "AICDA"           "ARID1B"          "CARD11"         
## [241] "CD4"             "COL16A1"         "CPLX2"           "CREBBP"         
## [245] "EP300"           "EZH2"            "FAM72A"          "KDM5B"          
## [249] "KMT2D"           "LERFS"           "MUC16"           "NCOR2"          
## [253] "NR4A2"           "POU2F3"          "RPS4Y1"          "SLC18A2"        
## [257] "TIGIT"           "TP53"            "TTN"             "VAMP5"          
## [261] "XBP1"            "Z82243.1"        "ZFHX3"           "ARID1B"         
## [265] "CARD11"          "CCDC8"           "CD4"             "CREBBP"         
## [269] "CTSLP3"          "EP300"           "EZH2"            "FAM72A"         
## [273] "HNRNPA1P70"      "IGKV1D-39"       "IGKV3-20"        "KDM5B"          
## [277] "KMT2D"           "LTF"             "MMP8"            "MPO"            
## [281] "MUC16"           "NCOR2"           "PES1P2"          "RPS4Y1"         
## [285] "TIGIT"           "TP53"            "TTN"             "XBP1"           
## [289] "ZFHX3"           "LMP1"

The ensemble list. LMP1 is an EBV gene, not human. Good to know. I remember that but only now do I remember that LMP1 wasn’t in our list even though we worked with a study entirely based on it, because LMP1 is the viral gene. Lets just use all the genes from our pathologies data that is from the EBV associated pathways.

EBVa <- pathologies$Genecards_ID[grep("EBV",pathologies$topGenePathology)]
EBVa
##   [1] "ANKRD22"         "CXCL10"          "IFI27"           "IL1R2"          
##   [5] "NCKAP5"          "FCER1A"          "ENSG00000286797" "LOC102724019"   
##   [9] "NRCAM"           "PIRAT1"          "LYZ"             "lnc-RSPH14-1"   
##  [13] "SOX5"            "PPBP"            "IGKC"            "lnc-RSPH14-1"   
##  [17] "HSALNG0035179"   "LINGO2"          "LOC105377276"    "ANKRD11"        
##  [21] "CD14"            "CD34"            "CEACAM6"         "CSF1R"          
##  [25] "FOXP3"           "HLA-DPA1"        "IL7"             "LBH"            
##  [29] "LTA"             "PRRX1"           "RAB38"           "WNT7A"          
##  [33] "AC139530.2"      "AC011511.4"      "AC092143.1"      "AL645465.1"     
##  [37] "AL353997.3"      "AC022384.1"      "MPO"             "TRIM6-TRIM34"   
##  [41] "FUT4"            "ITGA4"           "RPL23AP34"       "AC027088.3"     
##  [45] "MYOM3"           "CALCR"           "MOXD1"           "SMAD5-AS1"      
##  [49] "HSPE1P7"         "AC026464.2"      "RASA4B"          "AC026954.2"     
##  [53] "AL353997.3"      "AICDA"           "ARID1B"          "CARD11"         
##  [57] "CD4"             "COL16A1"         "CPLX2"           "CREBBP"         
##  [61] "EP300"           "EZH2"            "FAM72A"          "KDM5B"          
##  [65] "KMT2D"           "LERFS"           "MUC16"           "NCOR2"          
##  [69] "NR4A2"           "POU2F3"          "RPS4Y1"          "SLC18A2"        
##  [73] "TIGIT"           "TP53"            "TTN"             "VAMP5"          
##  [77] "XBP1"            "Z82243.1"        "ZFHX3"           "ARID1B"         
##  [81] "CARD11"          "CCDC8"           "CD4"             "CREBBP"         
##  [85] "CTSLP3"          "EP300"           "EZH2"            "FAM72A"         
##  [89] "HNRNPA1P70"      "IGKV1D-39"       "IGKV3-20"        "KDM5B"          
##  [93] "KMT2D"           "LTF"             "MMP8"            "MPO"            
##  [97] "MUC16"           "NCOR2"           "PES1P2"          "RPS4Y1"         
## [101] "TIGIT"           "TP53"            "TTN"             "XBP1"           
## [105] "ZFHX3"

Ok great! So we will work with this. There are 105 EBV specific to EBV associated genes. Now lets grab all these genes from our fibroid unfiltered and filtered data of genes with fold change values.

UL_set1 <- UL[UL$GeneSymbol %in% EBVa,]
UL_set2 <- UL_all_filtered1[UL_all_filtered1$GeneSymbol %in% EBVa,]

paged_table(UL_set1)

There are 78 genes in the pathologies data related to the EBV associated genes in the unfiltered set.

Lets look at a narrower data frame of our unfiltered 78 genes in the fibroid data.

UL_set1b <- UL_set1[,c(1:3,22,23,31)]

paged_table(UL_set1b)

The KDM5B gene is 8% up regulated over all samples. Lets see how it is in the White females and Black females.

The White females.

UL_set1c <- UL_set1[,c(1:3,25,26,33)]

paged_table(UL_set1c)

The KDM5B gene is 273% up regulated or enhanced in the White females. This gene has been associated with EBV infection and turning on and off gene transcription and signaling in Nasopharyngeal cancer and in gastric carcinoma by EBV.

Lets see how KDM5B is in Black females.

UL_set1d <- UL_set1[,c(1:3,28,29,35)]

paged_table(UL_set1d)

That is interesting, because in the black female samples KDM5B is actually inhibited or down regulated 55% by only operating at 45.7 % it’s normal range of gene expression. In the study that looked at gene KDM5B, this was the region of chromosome 1 that EBV liked to attach and started interfering with normal cell division. They looked at a couple of metastatic genes as well in that study of VEGFA and VCAM1, where VEGFA is a tumor angiogenesis promoter and T cell suppressor seen in metastasis, while VCAM1 is an encoding vascular cell adhesion molecule 1 seen in metastasis with tumor cell invasion and cellular immune response. Lets see how these 2 genes are in the larger data.

vegfa <- c("VCAM1","VEGFA", "KDM5B","CD4")

VEGFA_VCAM1 <- UL[UL$GeneSymbol %in% vegfa,]

paged_table(VEGFA_VCAM1)

The fold change in both races for Uterine fibroid compared to normal is enhanced 82.7% for VCAM1 and KDM5B is enhanced 8.5%, but silenced 48% for VEGFA and 9% for CD4. The fold change values for VCAM1, KDM5B, VEGFA, and CD4 is upregulated or enhanced in White females, but all are down regulated or silenced in Black females.

Looks like there are some differences in race for KDM5B gene in uterine fibroids. I only analyzed the samples in this study, but quickly looking over the results of the published article in the conclusion. The conclusion suggests that due to the excess production of fibrinogen and extra cellular matrix genes in Black females compared to White women, that the Black females are developing uterine fibroids on a larger proportion than White females due to their extra cellular matrix characteristics before fibroids develop.

Other genes are also inhibited in Black females that we saw in our last study on an African population with Hodgkin’s and some with EBV and some with EBV and HIV. The genes of CD4, TIGIT, and NCOR2 are all inhibited. So is the gene in HIV and EBV of XBP1, and also another cHL or Hodgkins gene of EP300. We can go back up to the data of UL_set1 and see the summary statistics for these features. But we will need to turn it into a matrix first to get summary stats for genes.

black <- grep('black', colnames(UL_set1))

ebv_black <- UL_set1[,c(1:3, black)]

paged_table(ebv_black)
normal_black <- ebv_black[,c(2,7:9)]
uf_black <- ebv_black[,c(2,10:12)]

normal_b_t <- t(normal_black[,2:4])
uf_b_t <- t(uf_black[,2:4])

colnames(normal_b_t) <- normal_black$GeneSymbol
colnames(uf_b_t) <- uf_black$GeneSymbol
summary(normal_b_t[,c(5,51,76,57,60,21,40)])
##      KDM5B            CD4             XBP1           NCOR2          CREBBP    
##  Min.   :289.0   Min.   :41.00   Min.   :188.0   Min.   :1005   Min.   :1056  
##  1st Qu.:357.5   1st Qu.:49.50   1st Qu.:195.5   1st Qu.:1222   1st Qu.:1266  
##  Median :426.0   Median :58.00   Median :203.0   Median :1439   Median :1477  
##  Mean   :423.7   Mean   :52.67   Mean   :257.3   Mean   :1592   Mean   :1497  
##  3rd Qu.:491.0   3rd Qu.:58.50   3rd Qu.:292.0   3rd Qu.:1885   3rd Qu.:1718  
##  Max.   :556.0   Max.   :59.00   Max.   :381.0   Max.   :2331   Max.   :1958  
##      WNT7A             IL7   
##  Min.   :0.0000   Min.   :1  
##  1st Qu.:0.0000   1st Qu.:2  
##  Median :0.0000   Median :3  
##  Mean   :0.6667   Mean   :3  
##  3rd Qu.:1.0000   3rd Qu.:4  
##  Max.   :2.0000   Max.   :5

The normal summmary stats for selected genes over these 3 Black samples show median values close to the mean value.

summary(uf_b_t[,c(5,51,76,57,60,21,40)])
##      KDM5B          CD4             XBP1           NCOR2           CREBBP     
##  Min.   :104   Min.   :11.00   Min.   : 37.0   Min.   : 91.0   Min.   :145.0  
##  1st Qu.:141   1st Qu.:11.50   1st Qu.: 52.5   1st Qu.:195.0   1st Qu.:300.0  
##  Median :178   Median :12.00   Median : 68.0   Median :299.0   Median :455.0  
##  Mean   :194   Mean   :19.33   Mean   : 79.0   Mean   :293.3   Mean   :390.3  
##  3rd Qu.:239   3rd Qu.:23.50   3rd Qu.:100.0   3rd Qu.:394.5   3rd Qu.:513.0  
##  Max.   :300   Max.   :35.00   Max.   :132.0   Max.   :490.0   Max.   :571.0  
##      WNT7A             IL7     
##  Min.   : 0.000   Min.   :1.0  
##  1st Qu.: 0.000   1st Qu.:2.5  
##  Median : 0.000   Median :4.0  
##  Mean   : 3.333   Mean   :3.0  
##  3rd Qu.: 5.000   3rd Qu.:4.0  
##  Max.   :10.000   Max.   :4.0

Seems like most of these selected genes seen in the study on EBV in cHL, EBV, and HIV as well as some other EBV genes were higher in the normal tissue then dropped dramatically in the fibroid tissue. The WNT7A gene was mentioned but is associated with many tissue cancers like breast cancer. The IL7 is an immune gene related to interleukins and inflammation response. The WNT7A and IL7 genes are higher in the fibroid tissue than in normal tissue. That was in comparing Black females with uterine fibroids and genes in an EBV associated pathology.

The two races have different responses to gene expression as a subset of the population to how their uterine fibroids are made up of in Black females compared to White females.

That was interesting but there could be some hidden information that connects more to the EBV genes in the at risk groups. Lets pull the at risk of fibroid samples compared to normal for fold change values and briefly compare just these genes. Lets make a list of the ones we want to look at.

few <- c("KDM5B","IL7","WNT7A", "KDM5B",          "CD4"     ,        "XBP1",           "NCOR2",           "CREBBP")
risk <- UL[,c(2,32,34,36)]

riskFC <- risk[risk$GeneSymbol %in% few,]

paged_table(riskFC)

For the at risk of developing into a fibroid because it is myometrial tissue biopsied adjacent or as close as possible to a fibroid, all genes in every subset of race is enhanced or up regulated. Except for WNT7A in the White females only. But you can see all our selected EBV associated genes in nasopharyngeal carcinoma, gastric carcinoma, and Hodgkin’s Lymphoma are up regulated in tissue adjacent to a tumor in the myometrium of the uterus. The endometrial tissue is in front of the myometrium and the perimetrium is outside the myometrium but not those tissues, the study says that the myometrial tissue adjacent to the fibroid, and not other tissue of the uterus.

That was interesting because there might be a connection to EBV and the epithelial tissue to get to the myometrial tissue fibroids in uterine fibroids of White and Black females. But not the uterine fibroids of Black females, they take a different turn and down regulate, but the White females uterine fibroid tissue stays up regulated.

We know the EBV virus is from our previous studies researched that it lies dormant and its viral gene LMP1 can alter gene expression at the chromatin level and so can KDM5B. Studies show the epithelial lining of the gastrointestinal tract and nasal and pharynx passageways can develop carcinoma from an active EBV infection from its harmless latent state in the nucleus of the cells of the body of host infected ever. We also know that the CD4 T cell anergy can occur making the immune response lazy or ineffective and that lymph nodes can be enlarged due to blockages of clotted B cells from EBV, and there is a connection to demyelination in multiple sclerosis that needs further study to see how EBV is connected to multiple sclerosis later. But one study suggested it is due to the KDM5B interacting region on human host chromatin that disrupts normal cell signaling and transcription and translation of proteins at the cellular level that interrupts nerve myelination when the EBV latent virus becomes activated from stess or other environmental factors. The Gastrointestinal tract and the genitourinary tract are stratified epithelial tissue. There is a mnemonic for these tissues and cancer like skin cancer, gastric reflux, or viral infections causing changes to stratified epithelial cells that turn them into another type like squamous or cuboid or other than their natural cell shape, ‘if its satisfied, its stratisfied.’ No current studies popped up immediately on any connection to EBV with uterine fibroids, but it is still an unknown development for how uterine fibroids are formed. Many studies suggest and show its due to the different races of females and how their bodies develop fibroids, as more Black females have a reported incidence of uterine fibroids than other races, but there are still around 80% of all females that have them and many are not known to have them as they don’t get annual or regular or any check ups with their OB/GYN even is symptomatic but mostly because they assume people that don’t visit the doctor for pain are asymptomatic.

Thanks for joining this research into finding an EBV associated pathology in the female genitourinary tract but specifically the uterus.

=======================================================================

*** Part 2

After reading over the study, orginally I misinterpreted what was in the series information to be that the normal myometrium tissue is from the same uterus that the myometrial tissue that is at risk of turning into a uterine fibroid, and the uterine fibroid is. But that is incorrect. I reveiwed the published article after the analysis, and saw that the normal tissue was the myometrium of a White or Black female who was having a hysterectomy but not due to a fibroid or other pathology but for personal reasons such as a pelvic organ prolapse. While the uterine fibroid and the myometrial tissue at risk of a uterine fibroid is from the same uterus but at least 2 cm from the nearest uterine fibroid as these are biopsies taken after or during a hysterectomy. The goal of the study was to look at the extra cellular matrix (ECM) and pathways of ECM in Black females compared to White females to see if this has a significance in why Black females tend to have more fibroids, more pain with fibroids, larger fibroids, and more cases of uterine fibroids. They found that there were significant genetic differences between the races in genes related to this pathway of cartilage, fibrinectin, the ECM, and in normal myometrial tissue of Black compared to White females as well as in normal compared to at risk of developing a uterine fibroid. I won’t be selecting the target genes of this particular study as they had more to do with racial disparities or separation and not pathology compared to normal. They would just throw off the machine model building of pathologies to compare and predict a pathology of EBV associated disease compared to non EBV associated pathologies.

Lets look at all 60 genes again, and see which are duplicated.

duplicated <- all60[duplicated(all60$GeneSymbol),]
duplicated <- duplicated[order(duplicated$GeneSymbol,decreasing=T),]

paged_table(duplicated)

There are 14 genes duplicated.

dups <- duplicated$GeneSymbol

allDups <- all60[all60$GeneSymbol %in% dups,]

allDups <- allDups[order(allDups$GeneSymbol, decreasing=T),]

paged_table(allDups)

Lets keep these as our genes for uterine fibroid, since they are seen in more than one dataset of White, Black, or both races as a top gene.

colnames(pathologies)
## [1] "Ensembl_ID"           "Genecards_ID"         "FC_pathology_control"
## [4] "topGenePathology"     "mediaType"            "studySummarized"     
## [7] "GSE_study_ID"
colnames(allDups)
## [1] "GeneID"                       "GeneSymbol"                  
## [3] "GeneBiotype"                  "normal_mean"                 
## [5] "fibroid_mean"                 "foldchange_fibroid_vs_normal"
## [7] "foldchange_source"
dupsKept <- allDups[,c(1,2,6,7)]

colnames(dupsKept) <- c("Ensembl_ID","Genecards_ID", "FC_pathology_control","studySummarized")

paged_table(dupsKept)
write.csv(dupsKept,'common27_UF_genes.csv',row.names = F)

[4] “topGenePathology” “mediaType” “studySummarized”
[7] “GSE_study_ID”

dupsKept$topGenePathology <- "uterine fibroid myometrial tissue"
dupsKept$mediaType <- 'RNA of uterine fibroid biopsy tumor, normal, adjacent to tumor tissue'
dupsKept$studySummarized <- paste(dupsKept$studySummarized,"This study used RNA of uterine tissue with the hysterectomies of BLack and White females to compare the uterine fibroid gene expression data. The uterine fibroid and adjacent tissue to uterine fibroid as the 'at risk of fibroid' were from same uterus but the normal tissue was biopsied from a separate uterus of normal and healthy without any other patholgoy but having a hysterectomy due to personal reasons or because impacted by other health issues like pelvic organ prolapse. The study found that the Black females had more ECM and fibrinogen and fibronectin than White females and were more susceptible to developing fibroids before developing one. The samples were evenly split with half and half for 3 White normal, 3 White at risk, 3 White uterine fibroid, 3 Black normal, 3 Black at risk, and 3 Black uterine fibroid. There were a total of 18 samples", sep='...')
dupsKept$GSE_study_ID <- "GSE244187"

dupsKept1 <- dupsKept[,c(1:3,5,6,4,7)]

paged_table((dupsKept1))

Combine the two tables and save as csv.

Pathologies <- rbind(pathologies,dupsKept1)

paged_table(Pathologies[c(1:5,312:316),]) #look at first and last few observations
write.csv(Pathologies,'Pathologies_UF_added_4-2-2026.csv', row.names=F)

You can get the new pathologies database here.

Thanks so much, now we can move on with other pathologies of EBV associated and continue building our database to build a predictive model that can distinguish certain pathologies or show strong similarities between pathologies based on fold change values of various media types that have been normalized of gene expression values from research studies more current findings on pathologies selected.

=================================

*** Part 2 extension, testing out these 27 genes to see how well they predict the uterine fibroid classification of White or Black, and the normal uterine myometrial tissue.

Lets use our 27 genes from the UL database.

genesTop <- unique(dupsKept1$Genecards_ID)

genesTop
##  [1] "XIRP1"  "TPSB2"  "TPSAB1" "TBX15"  "SLPI"   "RPE65"  "MYH3"   "MMP13" 
##  [9] "DCX"    "CXCL14" "CHRM2"  "ASB5"   "ACTA1"

There are 13 unique genes from that data of 27 top genes by foldchange values of uterine fibroids data.

UL13 <- UL[UL$GeneSymbol %in% genesTop,]

paged_table(UL13)
colnames(UL13)
##  [1] "GeneID"                "GeneSymbol"            "GeneBiotype"          
##  [4] "MyoF.348_S12_white"    "MyoF.428_S11_white"    "MyoF.483_S8_black"    
##  [7] "MyoF.526_S10_white"    "MyoF.UI.10_S7_black"   "MyoF.UI.13_S9_black"  
## [10] "MyoN.432_S4_white"     "MyoN.514_S2_black"     "MyoN.549_S5_white"    
## [13] "MyoN.UI.20_S1_black"   "MyoN.UI.43_S3_black"   "MyoN.UI.8_S6_white"   
## [16] "UF.372_S18_white"      "UF.428_S17_white"      "UF.483_S14_black"     
## [19] "UF.526_S16_white"      "UF.UI.13_S15_black"    "UF.UI.23_S13_black"   
## [22] "normal_all_mean"       "UF_all_mean"           "UF_all_risk_mean"     
## [25] "normal_white_mean"     "UF_white_mean"         "UF_risk_white_mean"   
## [28] "normal_black_mean"     "UF_black_mean"         "UF_risk_black_mean"   
## [31] "UF_normal_all_FC"      "UF_risk_normal_all_FC" "UF_normal_white_FC"   
## [34] "UF_risk_white_FC"      "UF_normal_black_FC"    "UF_risk_black_FC"
UL13_df <- UL13[,c(2,4:21)]

class <- c("atRisk", "atRisk","atRisk","atRisk","atRisk","atRisk",
           "normal", "normal","normal","normal","normal","normal",
           "uterine fibroid", "uterine fibroid","uterine fibroid",
           "uterine fibroid","uterine fibroid","uterine fibroid")

genesLabels <- UL13_df$GeneSymbol

paged_table(UL13_df)
UL13_mx <- data.frame(t(UL13_df[,c(2:19)]))

colnames(UL13_mx) <- genesLabels

UL13_mx$class <- class

paged_table(UL13_mx)

This is our matrix of 13 genes to use in predicting the class in a 3 class model using RandomForest with basic settings in classification.

library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.

Make sure our class feature is a factor or classification on the random forest model won’t work if a character type feature.

UL13_mx$class <- as.factor(UL13_mx$class)

set.seed(123)

inTrain <- sample(1:18,.8*18)

training <- UL13_mx[inTrain,]
testing <- UL13_mx[-inTrain,]

table(training$class)
## 
##          atRisk          normal uterine fibroid 
##               6               3               5

There are 6 samples at risk of uterine fibroid and that is all the samples at risk so we should change this because none will be in our testing class. We will manually put in every other sample. Because this data is balanced by having the same number of samples per class and subclass for this study as well.

inTrain <- c(1,2,3,4,5,7,8,9,11,12,13,15,17,18)

training <- UL13_mx[inTrain,]
testing <- UL13_mx[-inTrain,]

table(training$class)
## 
##          atRisk          normal uterine fibroid 
##               5               5               4
table(testing$class)
## 
##          atRisk          normal uterine fibroid 
##               1               1               2

This is about a split of 80% training and 20% testing.

rf <- randomForest(training[,c(1:13)], training$class, mtry=4, ntree=5000, confusion=T)
rf$confusion
##                 atRisk normal uterine fibroid class.error
## atRisk               2      3               0        0.60
## normal               0      5               0        0.00
## uterine fibroid      0      1               3        0.25

The training model scored 100% accuracy on the normal class, but only 75% accuracy on detecting a uterine fibroid and only 40% accuracy in detecting a sample as at risk of turning into a uterine fibroid.

Lets see how well it predicts on unseen data of our 20% hold out set.

predicted <- predict(rf,testing)
results <- data.frame(predicted=predicted, actual=testing$class)

paged_table(results)

On our hold out test the model predicted all uterine fibroid accurately with 100% accuracy on 2/2 samples, but it incorrectly predicted the one normal sample as at risk and the one at risk sample as normal.

Lets see how well it predicts a 2 class model of uterine fibroid and not uterine fibroid.

class2a <- c('not', 'not', 'not', 'not', 'not', 'not', 
             'not', 'not', 'not', 'not', 'not', 'not', 
             'fibroid', 'fibroid', 'fibroid', 'fibroid', 'fibroid', 'fibroid')

UL13_2classUL <- UL13_mx
UL13_2classUL$class <- class2a

paged_table(UL13_2classUL)

Now we test the 2 class model on not a uterine fibroid that gathered the normal and at risk myometrial tissue at least 2 cm from a uterine fibroid into the ‘not’ class, and kept the samples of uterine fibroid the same.

UL13_2classUL$class <- as.factor(UL13_2classUL$class) #make sure class is a factor

set.seed(123)

inTrain <- sample(1:18,.8*18)

training <- UL13_2classUL[inTrain,]
testing <- UL13_2classUL[-inTrain,]

table(training$class)
## 
## fibroid     not 
##       5       9
table(testing$class)
## 
## fibroid     not 
##       1       3

This model kept all but one sample of our 6 samples of uterine fibroid to train the model.

Lets see how well it does in predicting the correct class.

rf1 <- randomForest(training[,c(1:13)], training$class, mtry=4, ntree=5000, confusion=T)

rf1$confusion
##         fibroid not class.error
## fibroid       4   1         0.2
## not           0   9         0.0

The training model on this 2 class model had a 100% accuracy in predicting the not a fibroid class, but only 80% accuracy in predicting the fibroid class. Lets see how well it predicts on the testing set.

predicted1 <- predict(rf1,testing)

results <- data.frame(predicted=predicted1, actual=testing$class)

paged_table(results)

Thats great! It predicted 100% accuracy on the ‘not’ a fibroid class, and 100% accuracy on the ‘fibroid’ class, for 100% accuracy in testing a 2 class model of fibroid or not.

Now, since this study was based on fibroids being different altogether in White versus Black females, lets see how well the model can predict a 2 class model of White or Black for our samples of normal, at risk of fibroid, and fibroid samples.

row.names(UL13_mx)
##  [1] "MyoF.348_S12_white"  "MyoF.428_S11_white"  "MyoF.483_S8_black"  
##  [4] "MyoF.526_S10_white"  "MyoF.UI.10_S7_black" "MyoF.UI.13_S9_black"
##  [7] "MyoN.432_S4_white"   "MyoN.514_S2_black"   "MyoN.549_S5_white"  
## [10] "MyoN.UI.20_S1_black" "MyoN.UI.43_S3_black" "MyoN.UI.8_S6_white" 
## [13] "UF.372_S18_white"    "UF.428_S17_white"    "UF.483_S14_black"   
## [16] "UF.526_S16_white"    "UF.UI.13_S15_black"  "UF.UI.23_S13_black"
class2b <- c("white","white","black","white","black","black"
             ,"white","black","white","black","black","white"
             ,"white","white","black","white","black","black")
             
UL13_class2b <- UL13_mx
UL13_class2b$class <- as.factor(class2b)

paged_table(UL13_class2b)

There are half as many White as Black samples in normal, at risk, and fibroid. Lets see how well our model predicts a 2 class model by race of patient with uterine fibroid or normal tissue. Using only race to predict off of assertion that the uterine fibroid tissue is different to begin with by race between White and Black females based on other genes having to do with fibronectin and Extra Cellular Matrix that didn’t make our top genes by fold change.

set.seed(123)

inTrain <- sample(1:18,.8*18)

training <- UL13_class2b[inTrain,]
testing <- UL13_class2b[-inTrain,]

table(training$class)
## 
## black white 
##     7     7

There is an even split of samples per class with 7 samples each by race that includes if the sample was normal, at risk, or a uterine fibroid.

table(testing$class)
## 
## black white 
##     2     2

And an even split of testing samples with 2 sampes per class. This might not do well as there are even class splits per type of sample as normal, at risk, or fibroid, but maybe the model can predict by class of race to see if it is a significant difference. The range of limits per class as normal, at risk, and fibroid is by itself significant and now we see what happens.

rf3 <- randomForest(training[,c(1:13)], training$class, mtry=4, ntree=5000, confusion=T)

rf3$confusion
##       black white class.error
## black     3     4   0.5714286
## white     3     4   0.4285714

Ugh! So the class error on the training model is high for both classes by race. We don’t expect the testing class to be an improvement of much but lets see.

predicted3 <- predict(rf3,testing)

results <- data.frame(predicted=predicted3, actual=testing$class)

paged_table(results)

But on the testing set using a 2 class model for White uterus diseased or not versus Black uterus diseased or not, the model scored 50% accuracy in predicting 1 out of 2 Black samples as Black, and 100% accuracy on predicting the 2 out of 2 White samples as White, for a total of 75% accuracy.

Now lets see if the model can predict race in normal, race in at risk, and race in fibroid using these 13 genes.

We will separate the 6 samples of normal from 6 samples of at risk and from the 6 samples of fibroid and then separate by race and use a 6 sample model.

normal <- c("white","white","black","white","black","black")

atRisk <- c("white","black","white","black","black","white")

fibroid <- c("white","white","black","white","black","black")

UL13_normal <- UL13_mx[c(1:6),]
UL13_normal$class <- as.factor(normal)

UL13_atRisk <- UL13_mx[c(7:12),]
UL13_atRisk$class <- as.factor(atRisk)

UL13_fibroid <- UL13_mx[c(13:18),]
UL13_fibroid$class <- as.factor(fibroid)

paged_table(UL13_normal)
paged_table(UL13_atRisk)
paged_table(UL13_fibroid)

Now that we have our subset or mini data tables on type of tissue and classes by race we can see the changes in prediction by race within each class of normal, at risk, and fibroid separately.

set.seed(123)

inTrain <- sample(1:6,.8*6)

training1 <- UL13_normal[inTrain,]
testing1 <- UL13_normal[-inTrain,]

table(training1$class)
## 
## black white 
##     2     2

An even split for both training and testing with 2 samples in each class for training..

table(testing1$class)
## 
## black white 
##     1     1

… and 1 sample each in the testing set for our normal class of myometrial tissue.

rf_normal <- randomForest(training1[,c(1:13)], training1$class, mtry=4, ntree=5000, confusion=T)

rf_normal$confusion
##       black white class.error
## black     1     1         0.5
## white     1     1         0.5

Ooooh,ugh! So the race factor in the training model was also terrible in predicting the race in the normal uterine tissue. Lets see how well it does on the testing set.

prediction_normal <- predict(rf_normal, testing1)

results_normal <- data.frame(predicted=prediction_normal, actual=testing1$class)

paged_table(results_normal)

For the normal myometrial tissue, when predicting for class, the White class was predicted with 100% accuracy, but the Black class was incorrectly predicted at 0% accuracy.

Lets look at the at risk class and the subclass of race to determine if the at risk uterine myometrial tissue that is 2 cm or more from a uterine fibroid has enough significant difference that this model will predict the correct race.

set.seed(123)

inTrain <- sample(1:6,.8*6)

training2 <- UL13_atRisk[inTrain,]
testing2 <- UL13_atRisk[-inTrain,]

table(training2$class)
## 
## black white 
##     2     2
table(testing2$class)
## 
## black white 
##     1     1

An even split in class for training and testing sets. Lets see how well the model trains then predicts.

rf_atRisk <- randomForest(training2[c(1:13)], training2$class, mtry=4, ntree=5000, confusion=T)

rf_atRisk$confusion
##       black white class.error
## black     1     1         0.5
## white     0     2         0.0

The White class scored 100% accuracy on 2/2 samples but the Black class scored 50% accuracy in 1/2 classes predicted accurately. Lets see how well this model predicts.

prediction_atRisk <- predict(rf_atRisk,testing2)

results_atRisk <- data.frame(predicted=prediction_atRisk, actual=testing2$class)

paged_table(results_atRisk)

For the at risk myometrial tissue at risk of developing into a uterine fibroid, the race of the patient was 100% accurate in predicting the race as White 1/1 samples and the race as Black in 1/1 samples. This confirms that the study showed there are differences significant enough between the races in women who are Black or White when their myometrial tissue is adjacent to the uterine fibroid and how it changes gene expression levels when at risk.

Lets see how well the model predicts the race in the uterine fibroid samples.

set.seed(123)

inTrain <- sample(1:6,.8*6)

training3 <- UL13_fibroid[inTrain,]
testing3 <- UL13_fibroid[-inTrain,]

table(training3$class)
## 
## black white 
##     2     2
table(testing3$class)
## 
## black white 
##     1     1

The fibroid samples are evenly split by race the same as the other samples were. Lets build our model and test it.

rf_fibroid <- randomForest(training3[,c(1:13)], training3$class, mtry=4, ntree=5000, confusion=T)

rf_fibroid$confusion
##       black white class.error
## black     2     0         0.0
## white     1     1         0.5

On the training model, the model predicted 100% accuracy on the Black samples of fibroid, but only 50% accuracy on the White samples of fibroid. Lets see how well this model predicts on the hold out 20% testing set.

prediction_fibroid <- predict(rf_fibroid,testing3)

results_fibroid <- data.frame(predicted=prediction_fibroid, actual=testing3$class)

paged_table(results_fibroid)

This model predicted 100% accuracy in race of fibroid just like the at risk model did, but the normal tissue model only predicted 50% accuracy.

Lets look at these prediction results next to actual side by side.

predictions <- data.frame(rbind(results_normal,results_atRisk,results_fibroid))

row.names(predictions) <- c("normal1","normal2","atRisk1","atRisk2","fibroid1","fibroid2")

paged_table(predictions)

We can see a side by side comparison of the race being significant enough in determining the sample by at risk or having a uterine fibroid but not in normal uterine myometrial tissue. There are differences in the uterine myometrial tissue that occurs in each race of White or Black that takes place when myometrial tissue turns into a uterine fibroid and when it is a uterine fibroid.

Thanks so much, this little extension was added after using these genes by fold change and being common in at least 2 sample types by pathology state earlier. We now showed their affect when using the best classification model type of random forest classifier.

We will be exploring and analyzing the other EBV associated pathologies and colorectal cancer gene expression data in days and weeks to come to add to our database of pathologies. Next, will be gastric carcinoma so stay tuned and keep checking back.