We went over some Tableau genes in our top genes of non-EBV associated pathologies compared to top genes of many studies we found the top genes by fold change values and tested for predictive accuracy and found to be better than 70% and some even better than 90% in classifying their respective class in that study.

We now want to compare these genes that we viewed relationships of fold change from acute infectious mononucleosis (AIM) and Chronic Active Epstein-Barr Virus (CAEBV). We compared 12 genes that showed some positive and negative correlations as far as how big of a magnitude in the same or opposite direction the gene expression fold change of pathology compared to baseline pathology. Here are those genes and what we found.

Lets make a string vector of these genes to pull from these datasets.

relationalGenes <- c("ASPM","HISTIH3B","OLR1","IRG1","KIF11",
                     "ILG", "ILIA", "DTL", "FFAR2", "GPR84",
                     "CCNA2", "CCL20")

relationalGenes
##  [1] "ASPM"     "HISTIH3B" "OLR1"     "IRG1"     "KIF11"    "ILG"     
##  [7] "ILIA"     "DTL"      "FFAR2"    "GPR84"    "CCNA2"    "CCL20"

Lets read in some packages

library(rmarkdown)
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.

You can retrieve these data sets here at these links:

Lets read in the mono and EBV dataset first.

pathMono <- "path to CAEBV_genes_32670_FCs.csv"
setwd(pathMono)

monoEBV <- read.csv("CAEBV_genes_32670_FCs.csv") #32670 X 24

paged_table(monoEBV[1:10,])

Now lets read in the Fibromyalgia then the other datasets of non-EBV pathologies Chronic Fatigue Syndrome, Lyme Disease, and Uterine Leiomyoma.

pathFM <- "path to GeneSymbols_FM_FCs_filtered.csv"
setwd(pathFM)

FM <- read.csv("GeneSymbols_FM_FCs_filtered.csv") # 20142 X 17

paged_table(FM[1:10,])

Now read in the chronic fatigue syndrome dataset.

pathCFS <- "path to CFS_data_filtered_ordered_GSE293840.csv"
setwd(pathCFS)

CFS <- read.csv("CFS_data_filtered_ordered_GSE293840.csv") # 39378 X 174

paged_table(CFS[1:10,])

Lets read in the lyme disease dataset.

pathLyme <- "path to LymeDiseaseNormalizedFCsMeansAdded_June4th2026_ABS_min-x.csv"
setwd(pathLyme)

Lyme <- read.csv("LymeDiseaseNormalizedFCsMeansAdded_June4th2026_ABS_min-x.csv") #19526 X 95

paged_table(Lyme[1:10,])

Now we will add in the uterine leiomyoma dataset.

pathUL <- "path to UL_all_FCs_58735_notFiltered_hasNaNs_hasINf.csv"
setwd(pathUL)

UL <- read.csv("UL_all_FCs_58735_notFiltered_hasNaNs_hasINf.csv") #58735X36

paged_table(UL[1:10,])

Now we have 5 datasets to combine and get the genes from our relationalGenes string of genes.

=============================================================================

Part 2

6/30/2026 Tuesday 730pm

We want to understand why there were 12 genes with relationships with these non-EBV associated pathologies but at most there were only 9 genes in the data. We should have had at least the mono & EBV dataset have all 12 genes, maybe there was a typo somewhere.

relationalGenes
##  [1] "ASPM"     "HISTIH3B" "OLR1"     "IRG1"     "KIF11"    "ILG"     
##  [7] "ILIA"     "DTL"      "FFAR2"    "GPR84"    "CCNA2"    "CCL20"

The three genes not in the mono and EBV data set are HISTIH3B, ILG, and ILIA.

We scrolled the data set monoEBV by gene to the I variables and saw that we mistook the G in ILG that should be a 6 as in IL6, and the ILIA should be IL1A. And HISTIH3B should be HIST1H3B. Lets replace these in our character string of relationalGenes.

relationalGenes2 <- c("ASPM" ,    "HIST1H3B" ,"OLR1"  ,   "IRG1"   ,  "KIF11" ,   "IL6",      "IL1A"   , "DTL"   ,   "FFAR2"   , "GPR84"  ,  "CCNA2"   , "CCL20")

relationalGenes2
##  [1] "ASPM"     "HIST1H3B" "OLR1"     "IRG1"     "KIF11"    "IL6"     
##  [7] "IL1A"     "DTL"      "FFAR2"    "GPR84"    "CCNA2"    "CCL20"

We only want the actual samples and the gene name for all of these datasets, so we will omit the mean and FCs from each dataset as well as the ensembl ID and other features if present and not a sample of the gene name.

monoEBV_strict <- monoEBV[which(monoEBV$gene %in% relationalGenes2), c(2:19)]

paged_table(monoEBV_strict) #12 of the 12 relational genes

Lets make a class string for the monoEBV.

aim <- grep("AIM",colnames(monoEBV_strict))
caebv <- grep("CAEBV",colnames(monoEBV_strict))
healthy <- grep("healthy", colnames(monoEBV_strict))

classMono <- "gene"

classMono[aim] <- "AIM"
classMono[caebv] <- "CAEBV"
classMono[healthy] <- "healthy mono caebv"
colnames(monoEBV_strict)
##  [1] "gene"               "GSM2279022_AIM"     "GSM2279023_AIM"    
##  [4] "GSM2279024_AIM"     "GSM2279025_CAEBV"   "GSM2279026_AIM"    
##  [7] "GSM2279027_CAEBV"   "GSM2279028_CAEBV"   "GSM2279029_CAEBV"  
## [10] "GSM2279030_CAEBV"   "GSM2279031_healthy" "GSM2279032_healthy"
## [13] "GSM2279033_healthy" "GSM2279034_healthy" "GSM2279035_healthy"
## [16] "GSM2279036_healthy" "GSM2279037_AIM"     "GSM2279038_AIM"
classMono
##  [1] "gene"               "AIM"                "AIM"               
##  [4] "AIM"                "CAEBV"              "AIM"               
##  [7] "CAEBV"              "CAEBV"              "CAEBV"             
## [10] "CAEBV"              "healthy mono caebv" "healthy mono caebv"
## [13] "healthy mono caebv" "healthy mono caebv" "healthy mono caebv"
## [16] "healthy mono caebv" "AIM"                "AIM"

Those match without arranging order of the samples by type.

Lets now get the fibromyalgia dataset samples.

colnames(FM)
##  [1] "gene_id"        "gene_name"      "Healthy1"       "Healthy2"      
##  [5] "Healthy3"       "Healthy4"       "Healthy5"       "myo1"          
##  [9] "myo2"           "myo3"           "myo4"           "myo5"          
## [13] "myo6"           "myo7"           "healthy_Mean"   "myo_Mean"      
## [17] "FC_myo_healthy"
FM_strict <- FM[which(FM$gene_name %in% relationalGenes2),c(2:14)]

paged_table(FM_strict) #5 of the 12 relational genes
colnames(FM_strict)
##  [1] "gene_name" "Healthy1"  "Healthy2"  "Healthy3"  "Healthy4"  "Healthy5" 
##  [7] "myo1"      "myo2"      "myo3"      "myo4"      "myo5"      "myo6"     
## [13] "myo7"
classFM <- "gene"

healthyFM <- grep("Healthy", colnames(FM_strict))
fibromyalgia <- grep("myo",colnames(FM_strict))

classFM[healthyFM] <- 'healthy FM'
classFM[fibromyalgia] <- 'fibromyalgia'

classFM
##  [1] "gene"         "healthy FM"   "healthy FM"   "healthy FM"   "healthy FM"  
##  [6] "healthy FM"   "fibromyalgia" "fibromyalgia" "fibromyalgia" "fibromyalgia"
## [11] "fibromyalgia" "fibromyalgia" "fibromyalgia"

Lets now get the Chronic Fatigue Syndrome data in same format.

colnames(CFS)
##   [1] "gene_id"                "gene_name"              "Ensembl_transcript"    
##   [4] "control_1"              "control_2"              "control_3"             
##   [7] "case_4"                 "control_5"              "case_6"                
##  [10] "control_7"              "control_8"              "case_11"               
##  [13] "case_12"                "case_13"                "case_14"               
##  [16] "control_15"             "case_16"                "control_17"            
##  [19] "case_18"                "control_21"             "control_22"            
##  [22] "case_23"                "control_24"             "case_25"               
##  [25] "case_26"                "case_27"                "case_28"               
##  [28] "case_31"                "control_32"             "case_33"               
##  [31] "control_34"             "case_35"                "control_36"            
##  [34] "control_37"             "control_38"             "case_41"               
##  [37] "case_42"                "control_43"             "control_44"            
##  [40] "control_45"             "case_46"                "control_47"            
##  [43] "control_48"             "control_51"             "case_52"               
##  [46] "case_53"                "control_54"             "control_55"            
##  [49] "case_56"                "control_57"             "case_58"               
##  [52] "control_59"             "control_60"             "case_63"               
##  [55] "case_64"                "case_65"                "control_66"            
##  [58] "case_67"                "control_68"             "case_69"               
##  [61] "case_70"                "case_71"                "control_72"            
##  [64] "case_139"               "case_140"               "case_141"              
##  [67] "case_142"               "control_143"            "control_145"           
##  [70] "control_146"            "control_147"            "case_148"              
##  [73] "case_150"               "control_151"            "control_152"           
##  [76] "case_153"               "case_154"               "control_155"           
##  [79] "case_156"               "case_157"               "case_159"              
##  [82] "case_160"               "control_161"            "control_162"           
##  [85] "case_163"               "case_164"               "control_165"           
##  [88] "case_166"               "case_167"               "control_168"           
##  [91] "control_169"            "case_170"               "case_171"              
##  [94] "case_173"               "case_174"               "case_177"              
##  [97] "case_178"               "case_179"               "control_181"           
## [100] "case_182"               "control_183"            "control_184"           
## [103] "control_185"            "case_186"               "control_187"           
## [106] "control_188"            "control_189"            "control_190"           
## [109] "case_192"               "control_193"            "control_194"           
## [112] "control_195"            "case_196"               "case_197"              
## [115] "case_198"               "control_199"            "case_200"              
## [118] "case_201"               "case_202"               "case_204"              
## [121] "case_205"               "case_206"               "control_207"           
## [124] "control_208"            "control_209"            "case_211"              
## [127] "control_212"            "case_213"               "control_214"           
## [130] "control_215"            "case_219"               "control_220"           
## [133] "case_221"               "case_222"               "case_223"              
## [136] "case_224"               "case_225"               "case_226"              
## [139] "case_230"               "case_231"               "control_232"           
## [142] "case_233"               "case_235"               "control_236"           
## [145] "case_240"               "case_241"               "case_242"              
## [148] "control_243"            "control_244"            "case_245"              
## [151] "control_246"            "case_247"               "case_248"              
## [154] "case_251"               "control_252"            "case_253"              
## [157] "case_254"               "control_255"            "control_256"           
## [160] "control_257"            "case_258"               "case_259"              
## [163] "control_260"            "control_264"            "case_265"              
## [166] "case_266"               "control_267"            "control_268"           
## [169] "case_270"               "control_271"            "case_272"              
## [172] "healthy_mean"           "CSF_mean"               "foldchange_CSF_healthy"
CFS_strict <- CFS[which(CFS$gene_name %in% relationalGenes2),c(2,4:171)]

colnames(CFS_strict) #10 genes of the 12  relational genes
##   [1] "gene_name"   "control_1"   "control_2"   "control_3"   "case_4"     
##   [6] "control_5"   "case_6"      "control_7"   "control_8"   "case_11"    
##  [11] "case_12"     "case_13"     "case_14"     "control_15"  "case_16"    
##  [16] "control_17"  "case_18"     "control_21"  "control_22"  "case_23"    
##  [21] "control_24"  "case_25"     "case_26"     "case_27"     "case_28"    
##  [26] "case_31"     "control_32"  "case_33"     "control_34"  "case_35"    
##  [31] "control_36"  "control_37"  "control_38"  "case_41"     "case_42"    
##  [36] "control_43"  "control_44"  "control_45"  "case_46"     "control_47" 
##  [41] "control_48"  "control_51"  "case_52"     "case_53"     "control_54" 
##  [46] "control_55"  "case_56"     "control_57"  "case_58"     "control_59" 
##  [51] "control_60"  "case_63"     "case_64"     "case_65"     "control_66" 
##  [56] "case_67"     "control_68"  "case_69"     "case_70"     "case_71"    
##  [61] "control_72"  "case_139"    "case_140"    "case_141"    "case_142"   
##  [66] "control_143" "control_145" "control_146" "control_147" "case_148"   
##  [71] "case_150"    "control_151" "control_152" "case_153"    "case_154"   
##  [76] "control_155" "case_156"    "case_157"    "case_159"    "case_160"   
##  [81] "control_161" "control_162" "case_163"    "case_164"    "control_165"
##  [86] "case_166"    "case_167"    "control_168" "control_169" "case_170"   
##  [91] "case_171"    "case_173"    "case_174"    "case_177"    "case_178"   
##  [96] "case_179"    "control_181" "case_182"    "control_183" "control_184"
## [101] "control_185" "case_186"    "control_187" "control_188" "control_189"
## [106] "control_190" "case_192"    "control_193" "control_194" "control_195"
## [111] "case_196"    "case_197"    "case_198"    "control_199" "case_200"   
## [116] "case_201"    "case_202"    "case_204"    "case_205"    "case_206"   
## [121] "control_207" "control_208" "control_209" "case_211"    "control_212"
## [126] "case_213"    "control_214" "control_215" "case_219"    "control_220"
## [131] "case_221"    "case_222"    "case_223"    "case_224"    "case_225"   
## [136] "case_226"    "case_230"    "case_231"    "control_232" "case_233"   
## [141] "case_235"    "control_236" "case_240"    "case_241"    "case_242"   
## [146] "control_243" "control_244" "case_245"    "control_246" "case_247"   
## [151] "case_248"    "case_251"    "control_252" "case_253"    "case_254"   
## [156] "control_255" "control_256" "control_257" "case_258"    "case_259"   
## [161] "control_260" "control_264" "case_265"    "case_266"    "control_267"
## [166] "control_268" "case_270"    "control_271" "case_272"
classCFS <- "gene"

cfs <- grep('case',colnames(CFS_strict))
healthyCFS <- grep('control',colnames(CFS_strict))

classCFS[cfs] <- "Chronic Fatigue Syndrome"
classCFS[healthyCFS] <- "healthy CFS"

classCFS
##   [1] "gene"                     "healthy CFS"             
##   [3] "healthy CFS"              "healthy CFS"             
##   [5] "Chronic Fatigue Syndrome" "healthy CFS"             
##   [7] "Chronic Fatigue Syndrome" "healthy CFS"             
##   [9] "healthy CFS"              "Chronic Fatigue Syndrome"
##  [11] "Chronic Fatigue Syndrome" "Chronic Fatigue Syndrome"
##  [13] "Chronic Fatigue Syndrome" "healthy CFS"             
##  [15] "Chronic Fatigue Syndrome" "healthy CFS"             
##  [17] "Chronic Fatigue Syndrome" "healthy CFS"             
##  [19] "healthy CFS"              "Chronic Fatigue Syndrome"
##  [21] "healthy CFS"              "Chronic Fatigue Syndrome"
##  [23] "Chronic Fatigue Syndrome" "Chronic Fatigue Syndrome"
##  [25] "Chronic Fatigue Syndrome" "Chronic Fatigue Syndrome"
##  [27] "healthy CFS"              "Chronic Fatigue Syndrome"
##  [29] "healthy CFS"              "Chronic Fatigue Syndrome"
##  [31] "healthy CFS"              "healthy CFS"             
##  [33] "healthy CFS"              "Chronic Fatigue Syndrome"
##  [35] "Chronic Fatigue Syndrome" "healthy CFS"             
##  [37] "healthy CFS"              "healthy CFS"             
##  [39] "Chronic Fatigue Syndrome" "healthy CFS"             
##  [41] "healthy CFS"              "healthy CFS"             
##  [43] "Chronic Fatigue Syndrome" "Chronic Fatigue Syndrome"
##  [45] "healthy CFS"              "healthy CFS"             
##  [47] "Chronic Fatigue Syndrome" "healthy CFS"             
##  [49] "Chronic Fatigue Syndrome" "healthy CFS"             
##  [51] "healthy CFS"              "Chronic Fatigue Syndrome"
##  [53] "Chronic Fatigue Syndrome" "Chronic Fatigue Syndrome"
##  [55] "healthy CFS"              "Chronic Fatigue Syndrome"
##  [57] "healthy CFS"              "Chronic Fatigue Syndrome"
##  [59] "Chronic Fatigue Syndrome" "Chronic Fatigue Syndrome"
##  [61] "healthy CFS"              "Chronic Fatigue Syndrome"
##  [63] "Chronic Fatigue Syndrome" "Chronic Fatigue Syndrome"
##  [65] "Chronic Fatigue Syndrome" "healthy CFS"             
##  [67] "healthy CFS"              "healthy CFS"             
##  [69] "healthy CFS"              "Chronic Fatigue Syndrome"
##  [71] "Chronic Fatigue Syndrome" "healthy CFS"             
##  [73] "healthy CFS"              "Chronic Fatigue Syndrome"
##  [75] "Chronic Fatigue Syndrome" "healthy CFS"             
##  [77] "Chronic Fatigue Syndrome" "Chronic Fatigue Syndrome"
##  [79] "Chronic Fatigue Syndrome" "Chronic Fatigue Syndrome"
##  [81] "healthy CFS"              "healthy CFS"             
##  [83] "Chronic Fatigue Syndrome" "Chronic Fatigue Syndrome"
##  [85] "healthy CFS"              "Chronic Fatigue Syndrome"
##  [87] "Chronic Fatigue Syndrome" "healthy CFS"             
##  [89] "healthy CFS"              "Chronic Fatigue Syndrome"
##  [91] "Chronic Fatigue Syndrome" "Chronic Fatigue Syndrome"
##  [93] "Chronic Fatigue Syndrome" "Chronic Fatigue Syndrome"
##  [95] "Chronic Fatigue Syndrome" "Chronic Fatigue Syndrome"
##  [97] "healthy CFS"              "Chronic Fatigue Syndrome"
##  [99] "healthy CFS"              "healthy CFS"             
## [101] "healthy CFS"              "Chronic Fatigue Syndrome"
## [103] "healthy CFS"              "healthy CFS"             
## [105] "healthy CFS"              "healthy CFS"             
## [107] "Chronic Fatigue Syndrome" "healthy CFS"             
## [109] "healthy CFS"              "healthy CFS"             
## [111] "Chronic Fatigue Syndrome" "Chronic Fatigue Syndrome"
## [113] "Chronic Fatigue Syndrome" "healthy CFS"             
## [115] "Chronic Fatigue Syndrome" "Chronic Fatigue Syndrome"
## [117] "Chronic Fatigue Syndrome" "Chronic Fatigue Syndrome"
## [119] "Chronic Fatigue Syndrome" "Chronic Fatigue Syndrome"
## [121] "healthy CFS"              "healthy CFS"             
## [123] "healthy CFS"              "Chronic Fatigue Syndrome"
## [125] "healthy CFS"              "Chronic Fatigue Syndrome"
## [127] "healthy CFS"              "healthy CFS"             
## [129] "Chronic Fatigue Syndrome" "healthy CFS"             
## [131] "Chronic Fatigue Syndrome" "Chronic Fatigue Syndrome"
## [133] "Chronic Fatigue Syndrome" "Chronic Fatigue Syndrome"
## [135] "Chronic Fatigue Syndrome" "Chronic Fatigue Syndrome"
## [137] "Chronic Fatigue Syndrome" "Chronic Fatigue Syndrome"
## [139] "healthy CFS"              "Chronic Fatigue Syndrome"
## [141] "Chronic Fatigue Syndrome" "healthy CFS"             
## [143] "Chronic Fatigue Syndrome" "Chronic Fatigue Syndrome"
## [145] "Chronic Fatigue Syndrome" "healthy CFS"             
## [147] "healthy CFS"              "Chronic Fatigue Syndrome"
## [149] "healthy CFS"              "Chronic Fatigue Syndrome"
## [151] "Chronic Fatigue Syndrome" "Chronic Fatigue Syndrome"
## [153] "healthy CFS"              "Chronic Fatigue Syndrome"
## [155] "Chronic Fatigue Syndrome" "healthy CFS"             
## [157] "healthy CFS"              "healthy CFS"             
## [159] "Chronic Fatigue Syndrome" "Chronic Fatigue Syndrome"
## [161] "healthy CFS"              "healthy CFS"             
## [163] "Chronic Fatigue Syndrome" "Chronic Fatigue Syndrome"
## [165] "healthy CFS"              "healthy CFS"             
## [167] "Chronic Fatigue Syndrome" "healthy CFS"             
## [169] "Chronic Fatigue Syndrome"

Lets do the same thing or process to the UL dataset.

colnames(UL)
##  [1] "GeneID"                "GeneSymbol"            "GeneBiotype"          
##  [4] "MyoF.348_S12_white"    "MyoF.428_S11_white"    "MyoF.483_S8_black"    
##  [7] "MyoF.526_S10_white"    "MyoF.UI.10_S7_black"   "MyoF.UI.13_S9_black"  
## [10] "MyoN.432_S4_white"     "MyoN.514_S2_black"     "MyoN.549_S5_white"    
## [13] "MyoN.UI.20_S1_black"   "MyoN.UI.43_S3_black"   "MyoN.UI.8_S6_white"   
## [16] "UF.372_S18_white"      "UF.428_S17_white"      "UF.483_S14_black"     
## [19] "UF.526_S16_white"      "UF.UI.13_S15_black"    "UF.UI.23_S13_black"   
## [22] "normal_all_mean"       "UF_all_mean"           "UF_all_risk_mean"     
## [25] "normal_white_mean"     "UF_white_mean"         "UF_risk_white_mean"   
## [28] "normal_black_mean"     "UF_black_mean"         "UF_risk_black_mean"   
## [31] "UF_normal_all_FC"      "UF_risk_normal_all_FC" "UF_normal_white_FC"   
## [34] "UF_risk_white_FC"      "UF_normal_black_FC"    "UF_risk_black_FC"
UL_strict <- UL[which(UL$GeneSymbol %in% relationalGenes2),c(2,4:21)]
colnames(UL_strict)
##  [1] "GeneSymbol"          "MyoF.348_S12_white"  "MyoF.428_S11_white" 
##  [4] "MyoF.483_S8_black"   "MyoF.526_S10_white"  "MyoF.UI.10_S7_black"
##  [7] "MyoF.UI.13_S9_black" "MyoN.432_S4_white"   "MyoN.514_S2_black"  
## [10] "MyoN.549_S5_white"   "MyoN.UI.20_S1_black" "MyoN.UI.43_S3_black"
## [13] "MyoN.UI.8_S6_white"  "UF.372_S18_white"    "UF.428_S17_white"   
## [16] "UF.483_S14_black"    "UF.526_S16_white"    "UF.UI.13_S15_black" 
## [19] "UF.UI.23_S13_black"

Note that in this study the MyoF is at risk tissue next to the uterine fibroid, the MyoN is normal myometrial tissue from somebody completely different, and the UF is the uterine fibroid.

classUL <- "gene"

healthyUL <- grep("MyoN",colnames(UL_strict))
ul <- grep("UF", colnames(UL_strict))
ulRisk <- grep("MyoF", colnames(UL_strict))

classUL[healthyUL] <- 'healthy uterine tissue'
classUL[ul] <- 'uterine leiomyoma'
classUL[ulRisk] <- 'UL surrounding tissue'

classUL
##  [1] "gene"                   "UL surrounding tissue"  "UL surrounding tissue" 
##  [4] "UL surrounding tissue"  "UL surrounding tissue"  "UL surrounding tissue" 
##  [7] "UL surrounding tissue"  "healthy uterine tissue" "healthy uterine tissue"
## [10] "healthy uterine tissue" "healthy uterine tissue" "healthy uterine tissue"
## [13] "healthy uterine tissue" "uterine leiomyoma"      "uterine leiomyoma"     
## [16] "uterine leiomyoma"      "uterine leiomyoma"      "uterine leiomyoma"     
## [19] "uterine leiomyoma"

Now for the Lyme disease data to be arranged as the others.

colnames(Lyme)
##  [1] "Gene"                      "healthyControl_1"         
##  [3] "healthyControl_2"          "healthyControl_3"         
##  [5] "healthyControl_4"          "healthyControl_5"         
##  [7] "healthyControl_6"          "healthyControl_7"         
##  [9] "healthyControl_8"          "healthyControl_9"         
## [11] "healthyControl_10"         "healthyControl_11"        
## [13] "healthyControl_12"         "healthyControl_13"        
## [15] "healthyControl_14"         "healthyControl_15"        
## [17] "healthyControl_16"         "healthyControl_17"        
## [19] "healthyControl_18"         "healthyControl_19"        
## [21] "healthyControl_20"         "healthyControl_21"        
## [23] "acuteLymeDisease_1"        "acuteLymeDisease_2"       
## [25] "acuteLymeDisease_3"        "acuteLymeDisease_4"       
## [27] "acuteLymeDisease_5"        "acuteLymeDisease_6"       
## [29] "acuteLymeDisease_7"        "acuteLymeDisease_8"       
## [31] "acuteLymeDisease_9"        "acuteLymeDisease_10"      
## [33] "acuteLymeDisease_11"       "acuteLymeDisease_12"      
## [35] "acuteLymeDisease_13"       "acuteLymeDisease_14"      
## [37] "acuteLymeDisease_15"       "acuteLymeDisease_16"      
## [39] "acuteLymeDisease_17"       "acuteLymeDisease_18"      
## [41] "acuteLymeDisease_19"       "acuteLymeDisease_20"      
## [43] "acuteLymeDisease_21"       "acuteLymeDisease_22"      
## [45] "acuteLymeDisease_23"       "acuteLymeDisease_24"      
## [47] "acuteLymeDisease_25"       "acuteLymeDisease_26"      
## [49] "acuteLymeDisease_27"       "acuteLymeDisease_28"      
## [51] "Antibodies_1month_1"       "Antibodies_1month_2"      
## [53] "Antibodies_1month_3"       "Antibodies_1month_4"      
## [55] "Antibodies_1month_5"       "Antibodies_1month_6"      
## [57] "Antibodies_1month_7"       "Antibodies_1month_8"      
## [59] "Antibodies_1month_9"       "Antibodies_1month_10"     
## [61] "Antibodies_1month_11"      "Antibodies_1month_12"     
## [63] "Antibodies_1month_13"      "Antibodies_1month_14"     
## [65] "Antibodies_1month_15"      "Antibodies_1month_16"     
## [67] "Antibodies_1month_17"      "Antibodies_1month_18"     
## [69] "Antibodies_1month_19"      "Antibodies_1month_20"     
## [71] "Antibodies_1month_21"      "Antibodies_1month_22"     
## [73] "Antibodies_1month_23"      "Antibodies_1month_24"     
## [75] "Antibodies_1month_25"      "Antibodies_1month_26"     
## [77] "Antibodies_1month_27"      "Antibodies_6months_1"     
## [79] "Antibodies_6months_2"      "Antibodies_6months_3"     
## [81] "Antibodies_6months_4"      "Antibodies_6months_5"     
## [83] "Antibodies_6months_6"      "Antibodies_6months_7"     
## [85] "Antibodies_6months_8"      "Antibodies_6months_9"     
## [87] "Antibodies_6months_10"     "healthy_mean"             
## [89] "acute_mean"                "month1_mean"              
## [91] "month6_mean"               "foldchange_acute_healthy" 
## [93] "foldchange_1month_healthy" "foldchange_6month_healthy"
## [95] "foldchange_6month_acute"
Lyme_strict <- Lyme[which(Lyme$Gene %in% relationalGenes2),c(1:87)]

colnames(Lyme_strict) #11 genes of the 12
##  [1] "Gene"                  "healthyControl_1"      "healthyControl_2"     
##  [4] "healthyControl_3"      "healthyControl_4"      "healthyControl_5"     
##  [7] "healthyControl_6"      "healthyControl_7"      "healthyControl_8"     
## [10] "healthyControl_9"      "healthyControl_10"     "healthyControl_11"    
## [13] "healthyControl_12"     "healthyControl_13"     "healthyControl_14"    
## [16] "healthyControl_15"     "healthyControl_16"     "healthyControl_17"    
## [19] "healthyControl_18"     "healthyControl_19"     "healthyControl_20"    
## [22] "healthyControl_21"     "acuteLymeDisease_1"    "acuteLymeDisease_2"   
## [25] "acuteLymeDisease_3"    "acuteLymeDisease_4"    "acuteLymeDisease_5"   
## [28] "acuteLymeDisease_6"    "acuteLymeDisease_7"    "acuteLymeDisease_8"   
## [31] "acuteLymeDisease_9"    "acuteLymeDisease_10"   "acuteLymeDisease_11"  
## [34] "acuteLymeDisease_12"   "acuteLymeDisease_13"   "acuteLymeDisease_14"  
## [37] "acuteLymeDisease_15"   "acuteLymeDisease_16"   "acuteLymeDisease_17"  
## [40] "acuteLymeDisease_18"   "acuteLymeDisease_19"   "acuteLymeDisease_20"  
## [43] "acuteLymeDisease_21"   "acuteLymeDisease_22"   "acuteLymeDisease_23"  
## [46] "acuteLymeDisease_24"   "acuteLymeDisease_25"   "acuteLymeDisease_26"  
## [49] "acuteLymeDisease_27"   "acuteLymeDisease_28"   "Antibodies_1month_1"  
## [52] "Antibodies_1month_2"   "Antibodies_1month_3"   "Antibodies_1month_4"  
## [55] "Antibodies_1month_5"   "Antibodies_1month_6"   "Antibodies_1month_7"  
## [58] "Antibodies_1month_8"   "Antibodies_1month_9"   "Antibodies_1month_10" 
## [61] "Antibodies_1month_11"  "Antibodies_1month_12"  "Antibodies_1month_13" 
## [64] "Antibodies_1month_14"  "Antibodies_1month_15"  "Antibodies_1month_16" 
## [67] "Antibodies_1month_17"  "Antibodies_1month_18"  "Antibodies_1month_19" 
## [70] "Antibodies_1month_20"  "Antibodies_1month_21"  "Antibodies_1month_22" 
## [73] "Antibodies_1month_23"  "Antibodies_1month_24"  "Antibodies_1month_25" 
## [76] "Antibodies_1month_26"  "Antibodies_1month_27"  "Antibodies_6months_1" 
## [79] "Antibodies_6months_2"  "Antibodies_6months_3"  "Antibodies_6months_4" 
## [82] "Antibodies_6months_5"  "Antibodies_6months_6"  "Antibodies_6months_7" 
## [85] "Antibodies_6months_8"  "Antibodies_6months_9"  "Antibodies_6months_10"
classLyme <- "gene"

healthyLyme <- grep('healthy',colnames(Lyme_strict))
acute <- grep('acute', colnames(Lyme_strict))
lyme1 <- grep('1month', colnames(Lyme_strict))
lyme6 <- grep('6month', colnames(Lyme_strict))

classLyme[healthyLyme] <- "healthy before lyme disease"
classLyme[acute] <- "lyme disease acute"
classLyme[lyme1] <- "lyme disease 1 month"
classLyme[lyme6] <- "lyme disease 6 months"

classLyme
##  [1] "gene"                        "healthy before lyme disease"
##  [3] "healthy before lyme disease" "healthy before lyme disease"
##  [5] "healthy before lyme disease" "healthy before lyme disease"
##  [7] "healthy before lyme disease" "healthy before lyme disease"
##  [9] "healthy before lyme disease" "healthy before lyme disease"
## [11] "healthy before lyme disease" "healthy before lyme disease"
## [13] "healthy before lyme disease" "healthy before lyme disease"
## [15] "healthy before lyme disease" "healthy before lyme disease"
## [17] "healthy before lyme disease" "healthy before lyme disease"
## [19] "healthy before lyme disease" "healthy before lyme disease"
## [21] "healthy before lyme disease" "healthy before lyme disease"
## [23] "lyme disease acute"          "lyme disease acute"         
## [25] "lyme disease acute"          "lyme disease acute"         
## [27] "lyme disease acute"          "lyme disease acute"         
## [29] "lyme disease acute"          "lyme disease acute"         
## [31] "lyme disease acute"          "lyme disease acute"         
## [33] "lyme disease acute"          "lyme disease acute"         
## [35] "lyme disease acute"          "lyme disease acute"         
## [37] "lyme disease acute"          "lyme disease acute"         
## [39] "lyme disease acute"          "lyme disease acute"         
## [41] "lyme disease acute"          "lyme disease acute"         
## [43] "lyme disease acute"          "lyme disease acute"         
## [45] "lyme disease acute"          "lyme disease acute"         
## [47] "lyme disease acute"          "lyme disease acute"         
## [49] "lyme disease acute"          "lyme disease acute"         
## [51] "lyme disease 1 month"        "lyme disease 1 month"       
## [53] "lyme disease 1 month"        "lyme disease 1 month"       
## [55] "lyme disease 1 month"        "lyme disease 1 month"       
## [57] "lyme disease 1 month"        "lyme disease 1 month"       
## [59] "lyme disease 1 month"        "lyme disease 1 month"       
## [61] "lyme disease 1 month"        "lyme disease 1 month"       
## [63] "lyme disease 1 month"        "lyme disease 1 month"       
## [65] "lyme disease 1 month"        "lyme disease 1 month"       
## [67] "lyme disease 1 month"        "lyme disease 1 month"       
## [69] "lyme disease 1 month"        "lyme disease 1 month"       
## [71] "lyme disease 1 month"        "lyme disease 1 month"       
## [73] "lyme disease 1 month"        "lyme disease 1 month"       
## [75] "lyme disease 1 month"        "lyme disease 1 month"       
## [77] "lyme disease 1 month"        "lyme disease 6 months"      
## [79] "lyme disease 6 months"       "lyme disease 6 months"      
## [81] "lyme disease 6 months"       "lyme disease 6 months"      
## [83] "lyme disease 6 months"       "lyme disease 6 months"      
## [85] "lyme disease 6 months"       "lyme disease 6 months"      
## [87] "lyme disease 6 months"

Lets look at the genes in the data by which genes in common among all these pathologies.

CFS_strict$gene_name
##  [1] "CCL20" "IL6"   "DTL"   "KIF11" "CCNA2" "ASPM"  "OLR1"  "FFAR2" "GPR84"
## [10] "IL1A"
FM_strict$gene_name
## [1] "FFAR2" "IL6"   "KIF11" "DTL"   "CCNA2"
Lyme_strict$Gene
##  [1] "OLR1"     "IL1A"     "CCL20"    "IL6"      "KIF11"    "FFAR2"   
##  [7] "GPR84"    "CCNA2"    "ASPM"     "DTL"      "HIST1H3B"
monoEBV_strict$gene
##  [1] "KIF11"    "ASPM"     "CCNA2"    "HIST1H3B" "DTL"      "FFAR2"   
##  [7] "IRG1"     "GPR84"    "IL6"      "CCL20"    "OLR1"     "IL1A"
UL_strict$GeneSymbol
##  [1] "ASPM"     "DTL"      "IL1A"     "CCL20"    "CCNA2"    "HIST1H3B"
##  [7] "IL6"      "KIF11"    "OLR1"     "GPR84"    "FFAR2"

It looks like the 5 genes that are limited in the uterine leiomyoma can be used in predicting the class of sample.( we redid the genes in the relational genes set due to an error in 3 genes misidentifying between fonts a 1 for an I)

genes4 <- FM_strict$gene_name

We are using the fibromyalgia or FM data of 4 genes that are common to the other data sets.

CFS4 <- CFS_strict[which(CFS_strict$gene_name %in% genes4),]
mono4 <- monoEBV_strict[which(monoEBV_strict$gene %in% genes4),]
Lyme4 <- Lyme_strict[which(Lyme_strict$Gene %in% genes4),]
UL4 <- UL_strict[which(UL_strict$GeneSymbol %in% genes4),]

Lets make our matrices for each of these and add in each class feature we just made.

CFS4_t <- data.frame(t(CFS4[,2:169]))
colnames(CFS4_t) <- CFS4$gene_name
CFS4_t$class <- classCFS[2:length(classCFS)]

paged_table(CFS4_t[1:10,])

*** This is where code has to be corrected to accommodate the new gene from error found. We have to redo the alphabetized order of the features now that we are working with 1 more gene.

CFS4_t2 <- CFS4_t[,c(4,2,5,1,3,6)]
colnames(CFS4_t2)
## [1] "CCNA2" "DTL"   "FFAR2" "IL6"   "KIF11" "class"

The above is the chronic fatigue syndrome, the next will be the fibromyalgia.

FM4_t <- data.frame(t(FM_strict[,2:13]))
colnames(FM4_t) <- FM_strict$gene_name
FM4_t$class <- classFM[2:length(classFM)]

paged_table(FM4_t)
FM4_t2 <- FM4_t[,c(5,4,1,2,3,6)]
colnames(FM4_t2)
## [1] "CCNA2" "DTL"   "FFAR2" "IL6"   "KIF11" "class"

Now for the Lyme disease data matrix. We just made the CFS and FM matrices and alphabatized the gene features.

Lyme4_t <- data.frame(t(Lyme4[,2:87]))
colnames(Lyme4_t) <- Lyme4$Gene
Lyme4_t$class <- classLyme[2:length(classLyme)]

paged_table(Lyme4_t[1:10,])
Lyme4_t2 <- Lyme4_t[,c(4,5,3,1,2,6)]

colnames(Lyme4_t2)
## [1] "CCNA2" "DTL"   "FFAR2" "IL6"   "KIF11" "class"

Next will be the UL matrix

UL4_t <- data.frame(t(UL4[,2:19]))
colnames(UL4_t) <- UL4$GeneSymbol
UL4_t$class <- classUL[2:length(classUL)]

paged_table(UL4_t[1:10,])
UL4_t2 <- UL4_t[,c(2,1,5,3,4,6)]

colnames(UL4_t2)
## [1] "CCNA2" "DTL"   "FFAR2" "IL6"   "KIF11" "class"

Next will be the last matrix of the mono and EBV genes.

mono4_t <- data.frame(t(mono4[,2:18]))
colnames(mono4_t) <- mono4$gene
mono4_t$class <- classMono[2:length(classMono)]

paged_table(mono4_t[1:10,])
mono4_t2 <- mono4_t[,c(2,3,4,5,1,6)]

colnames(mono4_t2)
## [1] "CCNA2" "DTL"   "FFAR2" "IL6"   "KIF11" "class"

Lets row bind all these samples together now that they have the same feature IDs by gene and class.

matrix5sets <- rbind(mono4_t2,FM4_t2,CFS4_t2,UL4_t2,Lyme4_t2)
# 301 X 6

paged_table(matrix5sets[c(1:10,50:75,100:125),])
write.csv(matrix5sets,'matrix5genes.csv', row.names=F)

Now lets replace the healthy samples to only have one sample name of healthy.

table(matrix5sets$class)
## 
##                         AIM                       CAEBV 
##                           6                           5 
##    Chronic Fatigue Syndrome                fibromyalgia 
##                          93                           7 
## healthy before lyme disease                 healthy CFS 
##                          21                          75 
##                  healthy FM          healthy mono caebv 
##                           5                           6 
##      healthy uterine tissue        lyme disease 1 month 
##                           6                          27 
##       lyme disease 6 months          lyme disease acute 
##                          10                          28 
##       UL surrounding tissue           uterine leiomyoma 
##                           6                           6
healthy5 <- grep('healthy',matrix5sets$class)

matrix5sets$class[healthy5] <- 'healthy'

table(matrix5sets$class)
## 
##                      AIM                    CAEBV Chronic Fatigue Syndrome 
##                        6                        5                       93 
##             fibromyalgia                  healthy     lyme disease 1 month 
##                        7                      113                       27 
##    lyme disease 6 months       lyme disease acute    UL surrounding tissue 
##                       10                       28                        6 
##        uterine leiomyoma 
##                        6
write.csv(matrix5sets,'matrix5sets_healthy5into1healthy_part2with5genesNot4genes.csv', row.names=F)
matrix5sets$class <- as.factor(matrix5sets$class)

set.seed(125)

inTrain <- sample(1:301, .8*301)

training <- matrix5sets[inTrain,]
testing <- matrix5sets[-inTrain,]

table(training$class)
## 
##                      AIM                    CAEBV Chronic Fatigue Syndrome 
##                        5                        5                       69 
##             fibromyalgia                  healthy     lyme disease 1 month 
##                        6                       91                       26 
##    lyme disease 6 months       lyme disease acute    UL surrounding tissue 
##                        8                       19                        6 
##        uterine leiomyoma 
##                        5
table(testing$class)
## 
##                      AIM                    CAEBV Chronic Fatigue Syndrome 
##                        1                        0                       24 
##             fibromyalgia                  healthy     lyme disease 1 month 
##                        1                       22                        1 
##    lyme disease 6 months       lyme disease acute    UL surrounding tissue 
##                        2                        9                        0 
##        uterine leiomyoma 
##                        1
rf1 <- randomForest(training[1:5], training$class, mtry=3, ntree=5000, confusion=T)

rf1$confusion
##                          AIM CAEBV Chronic Fatigue Syndrome fibromyalgia
## AIM                        5     0                        0            0
## CAEBV                      0     4                        0            0
## Chronic Fatigue Syndrome   0     0                       39            0
## fibromyalgia               0     0                        0            3
## healthy                    0     1                       29            1
## lyme disease 1 month       0     0                        0            1
## lyme disease 6 months      0     0                        0            0
## lyme disease acute         0     0                        0            0
## UL surrounding tissue      0     0                        1            0
## uterine leiomyoma          0     0                        1            0
##                          healthy lyme disease 1 month lyme disease 6 months
## AIM                            0                    0                     0
## CAEBV                          1                    0                     0
## Chronic Fatigue Syndrome      30                    0                     0
## fibromyalgia                   1                    1                     0
## healthy                       44                    9                     1
## lyme disease 1 month           8                   13                     1
## lyme disease 6 months          5                    1                     1
## lyme disease acute             7                    5                     0
## UL surrounding tissue          3                    0                     0
## uterine leiomyoma              3                    0                     0
##                          lyme disease acute UL surrounding tissue
## AIM                                       0                     0
## CAEBV                                     0                     0
## Chronic Fatigue Syndrome                  0                     0
## fibromyalgia                              1                     0
## healthy                                   5                     0
## lyme disease 1 month                      3                     0
## lyme disease 6 months                     1                     0
## lyme disease acute                        7                     0
## UL surrounding tissue                     0                     0
## uterine leiomyoma                         0                     1
##                          uterine leiomyoma class.error
## AIM                                      0   0.0000000
## CAEBV                                    0   0.2000000
## Chronic Fatigue Syndrome                 0   0.4347826
## fibromyalgia                             0   0.5000000
## healthy                                  1   0.5164835
## lyme disease 1 month                     0   0.5000000
## lyme disease 6 months                    0   0.8750000
## lyme disease acute                       0   0.6315789
## UL surrounding tissue                    2   1.0000000
## uterine leiomyoma                        0   1.0000000
prediction1 <- predict(rf1,testing)

results1 <- data.frame(predicted=prediction1, actual=testing$class)

results1
##                                      predicted                   actual
## GSM2279024_AIM                             AIM                      AIM
## GSM2279035_healthy                     healthy                  healthy
## Healthy4                               healthy                  healthy
## myo6                                   healthy             fibromyalgia
## control_3             Chronic Fatigue Syndrome                  healthy
## control_24            Chronic Fatigue Syndrome                  healthy
## case_27                                healthy Chronic Fatigue Syndrome
## control_37            Chronic Fatigue Syndrome                  healthy
## case_42               Chronic Fatigue Syndrome Chronic Fatigue Syndrome
## case_46               Chronic Fatigue Syndrome Chronic Fatigue Syndrome
## control_47                             healthy                  healthy
## control_51            Chronic Fatigue Syndrome                  healthy
## case_58                                healthy Chronic Fatigue Syndrome
## case_63                                healthy Chronic Fatigue Syndrome
## case_67               Chronic Fatigue Syndrome Chronic Fatigue Syndrome
## case_140              Chronic Fatigue Syndrome Chronic Fatigue Syndrome
## control_143           Chronic Fatigue Syndrome                  healthy
## control_146           Chronic Fatigue Syndrome                  healthy
## control_147                            healthy                  healthy
## case_148              Chronic Fatigue Syndrome Chronic Fatigue Syndrome
## control_152                            healthy                  healthy
## case_153                               healthy Chronic Fatigue Syndrome
## case_157              Chronic Fatigue Syndrome Chronic Fatigue Syndrome
## case_159                               healthy Chronic Fatigue Syndrome
## case_164              Chronic Fatigue Syndrome Chronic Fatigue Syndrome
## case_173              Chronic Fatigue Syndrome Chronic Fatigue Syndrome
## control_181           Chronic Fatigue Syndrome                  healthy
## control_185           Chronic Fatigue Syndrome                  healthy
## control_189           Chronic Fatigue Syndrome                  healthy
## control_190                            healthy                  healthy
## case_192              Chronic Fatigue Syndrome Chronic Fatigue Syndrome
## case_198              Chronic Fatigue Syndrome Chronic Fatigue Syndrome
## case_200              Chronic Fatigue Syndrome Chronic Fatigue Syndrome
## control_214           Chronic Fatigue Syndrome                  healthy
## case_221                               healthy Chronic Fatigue Syndrome
## case_223                               healthy Chronic Fatigue Syndrome
## case_224              Chronic Fatigue Syndrome Chronic Fatigue Syndrome
## case_225              Chronic Fatigue Syndrome Chronic Fatigue Syndrome
## case_230                               healthy Chronic Fatigue Syndrome
## case_233              Chronic Fatigue Syndrome Chronic Fatigue Syndrome
## case_245                               healthy Chronic Fatigue Syndrome
## case_254                               healthy Chronic Fatigue Syndrome
## control_255           Chronic Fatigue Syndrome                  healthy
## control_264           Chronic Fatigue Syndrome                  healthy
## control_267                            healthy                  healthy
## MyoN.549_S5_white                      healthy                  healthy
## UF.UI.13_S15_black       UL surrounding tissue        uterine leiomyoma
## healthyControl_6          lyme disease 1 month                  healthy
## healthyControl_11        lyme disease 6 months                  healthy
## acuteLymeDisease_3          lyme disease acute       lyme disease acute
## acuteLymeDisease_6        lyme disease 1 month       lyme disease acute
## acuteLymeDisease_8          lyme disease acute       lyme disease acute
## acuteLymeDisease_11       lyme disease 1 month       lyme disease acute
## acuteLymeDisease_14         lyme disease acute       lyme disease acute
## acuteLymeDisease_16       lyme disease 1 month       lyme disease acute
## acuteLymeDisease_21         lyme disease acute       lyme disease acute
## acuteLymeDisease_22         lyme disease acute       lyme disease acute
## acuteLymeDisease_28         lyme disease acute       lyme disease acute
## Antibodies_1month_7                    healthy     lyme disease 1 month
## Antibodies_6months_7                   healthy    lyme disease 6 months
## Antibodies_6months_10                  healthy    lyme disease 6 months

**** Now lets compare mono & EBV with UL ****

UL_monoGenes <- c("ANKRD22","HIST1H3B","KIF11","FFAR2","CCNA2")

UL_monoGenes
## [1] "ANKRD22"  "HIST1H3B" "KIF11"    "FFAR2"    "CCNA2"
UL_5 <- UL_strict[which(UL_strict$GeneSymbol %in% UL_monoGenes),]
mono_5 <- monoEBV_strict[which(monoEBV_strict$gene %in% UL_monoGenes),]
UL_5
##       GeneSymbol MyoF.348_S12_white MyoF.428_S11_white MyoF.483_S8_black
## 14041      CCNA2                  2                  8                29
## 18178   HIST1H3B                  0                  1                10
## 35431      KIF11                 10                 16               116
## 54471      FFAR2                  0                  0                58
##       MyoF.526_S10_white MyoF.UI.10_S7_black MyoF.UI.13_S9_black
## 14041                  5                  16                   1
## 18178                  0                   2                   1
## 35431                 10                  23                   3
## 54471                  0                   0                   0
##       MyoN.432_S4_white MyoN.514_S2_black MyoN.549_S5_white MyoN.UI.20_S1_black
## 14041                 2                 4                 1                   5
## 18178                 1                 0                 3                   5
## 35431                 3                 6                 2                   5
## 54471                 1                 1                 0                   0
##       MyoN.UI.43_S3_black MyoN.UI.8_S6_white UF.372_S18_white UF.428_S17_white
## 14041                   9                  1                8                5
## 18178                   1                  0                6                5
## 35431                  15                  2                9               14
## 54471                   0                  0                0                0
##       UF.483_S14_black UF.526_S16_white UF.UI.13_S15_black UF.UI.23_S13_black
## 14041                4                7                  1                  4
## 18178                7                4                  3                  1
## 35431               13               15                  8                 11
## 54471                0                0                  0                  4

The ANKRD22 gene was in the mono data but for some reason didn’t get added to the relationalGenes in part 1 and also in part 2.

So we will redo these datasets of strict genes in common to run our machine model on and classify by the classes in total between UL and mono&EBV.

UL5 <- UL[which(UL$GeneSymbol %in% UL_monoGenes),]

UL5
##                GeneID GeneSymbol    GeneBiotype MyoF.348_S12_white
## 14041 ENSG00000145386      CCNA2 protein_coding                  2
## 18178 ENSG00000274267   HIST1H3B protein_coding                  0
## 35355 ENSG00000152766    ANKRD22 protein_coding                  0
## 35431 ENSG00000138160      KIF11 protein_coding                 10
## 54471 ENSG00000126262      FFAR2 protein_coding                  0
##       MyoF.428_S11_white MyoF.483_S8_black MyoF.526_S10_white
## 14041                  8                29                  5
## 18178                  1                10                  0
## 35355                  2                41                  0
## 35431                 16               116                 10
## 54471                  0                58                  0
##       MyoF.UI.10_S7_black MyoF.UI.13_S9_black MyoN.432_S4_white
## 14041                  16                   1                 2
## 18178                   2                   1                 1
## 35355                   3                   0                 0
## 35431                  23                   3                 3
## 54471                   0                   0                 1
##       MyoN.514_S2_black MyoN.549_S5_white MyoN.UI.20_S1_black
## 14041                 4                 1                   5
## 18178                 0                 3                   5
## 35355                 0                 2                   1
## 35431                 6                 2                   5
## 54471                 1                 0                   0
##       MyoN.UI.43_S3_black MyoN.UI.8_S6_white UF.372_S18_white UF.428_S17_white
## 14041                   9                  1                8                5
## 18178                   1                  0                6                5
## 35355                   1                  0                0                1
## 35431                  15                  2                9               14
## 54471                   0                  0                0                0
##       UF.483_S14_black UF.526_S16_white UF.UI.13_S15_black UF.UI.23_S13_black
## 14041                4                7                  1                  4
## 18178                7                4                  3                  1
## 35355                0                1                  2                  6
## 35431               13               15                  8                 11
## 54471                0                0                  0                  4
##       normal_all_mean UF_all_mean UF_all_risk_mean normal_white_mean
## 14041       3.6666667   4.8333333        10.166667         1.3333333
## 18178       1.6666667   4.3333333         2.333333         1.3333333
## 35355       0.6666667   1.6666667         7.666667         0.6666667
## 35431       5.5000000  11.6666667        29.666667         2.3333333
## 54471       0.3333333   0.6666667         9.666667         0.3333333
##       UF_white_mean UF_risk_white_mean normal_black_mean UF_black_mean
## 14041     6.6666667          5.0000000         6.0000000      3.000000
## 18178     5.0000000          0.3333333         2.0000000      3.666667
## 35355     0.6666667          0.6666667         0.6666667      2.666667
## 35431    12.6666667         12.0000000         8.6666667     10.666667
## 54471     0.0000000          0.0000000         0.3333333      1.333333
##       UF_risk_black_mean UF_normal_all_FC UF_risk_normal_all_FC
## 14041          15.333333         1.318182              2.772727
## 18178           4.333333         2.600000              1.400000
## 35355          14.666667         2.500000             11.500000
## 35431          47.333333         2.121212              5.393939
## 54471          19.333333         2.000000             29.000000
##       UF_normal_white_FC UF_risk_white_FC UF_normal_black_FC UF_risk_black_FC
## 14041           5.000000         3.750000           0.500000         2.555556
## 18178           3.750000         0.250000           1.833333         2.166667
## 35355           1.000000         1.000000           4.000000        22.000000
## 35431           5.428571         5.142857           1.230769         5.461538
## 54471           0.000000         0.000000           4.000000        58.000000
classUL
##  [1] "gene"                   "UL surrounding tissue"  "UL surrounding tissue" 
##  [4] "UL surrounding tissue"  "UL surrounding tissue"  "UL surrounding tissue" 
##  [7] "UL surrounding tissue"  "healthy uterine tissue" "healthy uterine tissue"
## [10] "healthy uterine tissue" "healthy uterine tissue" "healthy uterine tissue"
## [13] "healthy uterine tissue" "uterine leiomyoma"      "uterine leiomyoma"     
## [16] "uterine leiomyoma"      "uterine leiomyoma"      "uterine leiomyoma"     
## [19] "uterine leiomyoma"
colnames(UL5)
##  [1] "GeneID"                "GeneSymbol"            "GeneBiotype"          
##  [4] "MyoF.348_S12_white"    "MyoF.428_S11_white"    "MyoF.483_S8_black"    
##  [7] "MyoF.526_S10_white"    "MyoF.UI.10_S7_black"   "MyoF.UI.13_S9_black"  
## [10] "MyoN.432_S4_white"     "MyoN.514_S2_black"     "MyoN.549_S5_white"    
## [13] "MyoN.UI.20_S1_black"   "MyoN.UI.43_S3_black"   "MyoN.UI.8_S6_white"   
## [16] "UF.372_S18_white"      "UF.428_S17_white"      "UF.483_S14_black"     
## [19] "UF.526_S16_white"      "UF.UI.13_S15_black"    "UF.UI.23_S13_black"   
## [22] "normal_all_mean"       "UF_all_mean"           "UF_all_risk_mean"     
## [25] "normal_white_mean"     "UF_white_mean"         "UF_risk_white_mean"   
## [28] "normal_black_mean"     "UF_black_mean"         "UF_risk_black_mean"   
## [31] "UF_normal_all_FC"      "UF_risk_normal_all_FC" "UF_normal_white_FC"   
## [34] "UF_risk_white_FC"      "UF_normal_black_FC"    "UF_risk_black_FC"
UL5_a <- UL5[,c(2,4:21)]
colnames(UL5_a)
##  [1] "GeneSymbol"          "MyoF.348_S12_white"  "MyoF.428_S11_white" 
##  [4] "MyoF.483_S8_black"   "MyoF.526_S10_white"  "MyoF.UI.10_S7_black"
##  [7] "MyoF.UI.13_S9_black" "MyoN.432_S4_white"   "MyoN.514_S2_black"  
## [10] "MyoN.549_S5_white"   "MyoN.UI.20_S1_black" "MyoN.UI.43_S3_black"
## [13] "MyoN.UI.8_S6_white"  "UF.372_S18_white"    "UF.428_S17_white"   
## [16] "UF.483_S14_black"    "UF.526_S16_white"    "UF.UI.13_S15_black" 
## [19] "UF.UI.23_S13_black"
UL5_t <- data.frame(t(UL5_a[,2:19]))

colnames(UL5_t) <- UL5_a$GeneSymbol

UL5_t$class <- classUL[2:19]

paged_table(UL5_t)
UL5_t2 <- UL5_t[,c(3,1,5,2,4,6)]

colnames(UL5_t2)
## [1] "ANKRD22"  "CCNA2"    "FFAR2"    "HIST1H3B" "KIF11"    "class"

Now do the same with our mono&EBV data to get these same 5 genes.

mono5 <- monoEBV[which(monoEBV$gene %in% UL_monoGenes),]

mono5
##          ID     gene GSM2279022_AIM GSM2279023_AIM GSM2279024_AIM
## 2     41163  ANKRD22       2.618010       2.835821       3.124850
## 44    40350    KIF11       4.366528       5.016053       5.332066
## 139   19187    CCNA2       6.089649       6.562936       6.650323
## 200   25360 HIST1H3B       4.993432       5.096159       5.210186
## 32474 62551    FFAR2       2.785898       2.625908       2.813352
##       GSM2279025_CAEBV GSM2279026_AIM GSM2279027_CAEBV GSM2279028_CAEBV
## 2             2.116486       3.088006         6.493938         4.197983
## 44            2.276617       5.826190         2.431858         1.983582
## 139           3.456825       7.208801         3.221267         2.579308
## 200           2.829063       5.446201         2.518839         1.946151
## 32474         2.935650       3.310563         7.966073         7.136373
##       GSM2279029_CAEBV GSM2279030_CAEBV GSM2279031_healthy GSM2279032_healthy
## 2             5.118928         4.327162           2.008278           2.033418
## 44            3.295993         2.566960           2.152256           1.726631
## 139           3.996350         3.490396           3.071337           2.617173
## 200           2.969806         2.849361           2.415078           2.168727
## 32474         7.195934         8.300824           7.510760           7.128594
##       GSM2279033_healthy GSM2279034_healthy GSM2279035_healthy
## 2               2.066117           2.379066           2.455447
## 44              1.783775           1.944827           1.861847
## 139             2.508872           2.747519           2.771293
## 200             2.168641           2.241686           2.189613
## 32474           7.895555           8.370027           9.220480
##       GSM2279036_healthy GSM2279037_AIM GSM2279038_AIM AIM_mean CAEBV_mean
## 2               2.984194       2.525890       2.762189 2.825794   4.450900
## 44              1.952959       4.370372       4.858228 4.961573   2.511002
## 139             3.009310       6.004854       6.457274 6.495640   3.348829
## 200             2.175338       4.046600       4.509862 4.883740   2.622644
## 32474           9.529778       4.170920       3.058326 3.127494   6.706971
##       healthy_mean FC_AIM_healthy FC_CAEBV_healthy
## 2         2.321087      1.2174445        1.9175929
## 44        1.903716      2.6062572        1.3190005
## 139       2.787584      2.3302040        1.2013375
## 200       2.226514      2.1934471        1.1779151
## 32474     8.275866      0.3779054        0.8104253
classMono
##  [1] "gene"               "AIM"                "AIM"               
##  [4] "AIM"                "CAEBV"              "AIM"               
##  [7] "CAEBV"              "CAEBV"              "CAEBV"             
## [10] "CAEBV"              "healthy mono caebv" "healthy mono caebv"
## [13] "healthy mono caebv" "healthy mono caebv" "healthy mono caebv"
## [16] "healthy mono caebv" "AIM"                "AIM"
colnames(mono5)
##  [1] "ID"                 "gene"               "GSM2279022_AIM"    
##  [4] "GSM2279023_AIM"     "GSM2279024_AIM"     "GSM2279025_CAEBV"  
##  [7] "GSM2279026_AIM"     "GSM2279027_CAEBV"   "GSM2279028_CAEBV"  
## [10] "GSM2279029_CAEBV"   "GSM2279030_CAEBV"   "GSM2279031_healthy"
## [13] "GSM2279032_healthy" "GSM2279033_healthy" "GSM2279034_healthy"
## [16] "GSM2279035_healthy" "GSM2279036_healthy" "GSM2279037_AIM"    
## [19] "GSM2279038_AIM"     "AIM_mean"           "CAEBV_mean"        
## [22] "healthy_mean"       "FC_AIM_healthy"     "FC_CAEBV_healthy"
mono5_a <- mono5[,c(2:19)]

colnames(mono5_a)
##  [1] "gene"               "GSM2279022_AIM"     "GSM2279023_AIM"    
##  [4] "GSM2279024_AIM"     "GSM2279025_CAEBV"   "GSM2279026_AIM"    
##  [7] "GSM2279027_CAEBV"   "GSM2279028_CAEBV"   "GSM2279029_CAEBV"  
## [10] "GSM2279030_CAEBV"   "GSM2279031_healthy" "GSM2279032_healthy"
## [13] "GSM2279033_healthy" "GSM2279034_healthy" "GSM2279035_healthy"
## [16] "GSM2279036_healthy" "GSM2279037_AIM"     "GSM2279038_AIM"
mono5_t <- data.frame(t(mono5_a[,2:18]))

colnames(mono5_t) <- mono5_a$gene

mono5_t$class <- classMono[2:18]

paged_table(mono5_t)
mono5_t2 <- mono5_t[,c(1,3,5,4,2,6)]

colnames(mono5_t2)
## [1] "ANKRD22"  "CCNA2"    "FFAR2"    "HIST1H3B" "KIF11"    "class"

Lets combine these into a matrix of mono, EBV, and UL with the healthy samples in each dataset.

matrixMonoUL <- rbind(mono5_t2,UL5_t2)

paged_table(matrixMonoUL)
table(matrixMonoUL$class)
## 
##                    AIM                  CAEBV     healthy mono caebv 
##                      6                      5                      6 
## healthy uterine tissue  UL surrounding tissue      uterine leiomyoma 
##                      6                      6                      6
matrixMonoUL$class <- as.factor(matrixMonoUL$class)
set.seed(1267)

inTrain <- sample(1:35,.75*35)

training <- matrixMonoUL[inTrain,]

testing <- matrixMonoUL[-inTrain,]

table(training$class)
## 
##                    AIM                  CAEBV     healthy mono caebv 
##                      5                      4                      4 
## healthy uterine tissue  UL surrounding tissue      uterine leiomyoma 
##                      4                      4                      5
table(testing$class)
## 
##                    AIM                  CAEBV     healthy mono caebv 
##                      1                      1                      2 
## healthy uterine tissue  UL surrounding tissue      uterine leiomyoma 
##                      2                      2                      1
rf <- randomForest(training[1:5], training$class, mtry=3, ntree=5000, confusion=T)

rf$confusion
##                        AIM CAEBV healthy mono caebv healthy uterine tissue
## AIM                      5     0                  0                      0
## CAEBV                    0     3                  1                      0
## healthy mono caebv       0     0                  4                      0
## healthy uterine tissue   0     0                  0                      3
## UL surrounding tissue    0     1                  0                      0
## uterine leiomyoma        0     0                  0                      0
##                        UL surrounding tissue uterine leiomyoma class.error
## AIM                                        0                 0        0.00
## CAEBV                                      0                 0        0.25
## healthy mono caebv                         0                 0        0.00
## healthy uterine tissue                     0                 1        0.25
## UL surrounding tissue                      3                 0        0.25
## uterine leiomyoma                          0                 5        0.00

Overall, between mono, EBV active, and uterine leiomyoma as well as these two datasets’ healthy samples, there was 75-100% accuracy in classifying the training data in this model.

predict1 <- predict(rf,testing)

results <- data.frame(predicted=predict1, actual=testing$class)

results
##                                  predicted                 actual
## GSM2279023_AIM                         AIM                    AIM
## GSM2279025_CAEBV        healthy mono caebv                  CAEBV
## GSM2279033_healthy      healthy mono caebv     healthy mono caebv
## GSM2279034_healthy      healthy mono caebv     healthy mono caebv
## MyoF.UI.10_S7_black  UL surrounding tissue  UL surrounding tissue
## MyoF.UI.13_S9_black healthy uterine tissue  UL surrounding tissue
## MyoN.549_S5_white   healthy uterine tissue healthy uterine tissue
## MyoN.UI.43_S3_black  UL surrounding tissue healthy uterine tissue
## UF.UI.23_S13_black   UL surrounding tissue      uterine leiomyoma

For prediction accuracy, we can see that 4/9 were misclassified. But further inspection shows that the UL tissue that is by the tumor is misclassified as healthy, and vice versa, as well as a UL sample of tumor tissue being misclassified as the tissue adjacent to the tumor. And one chronic active EBV sample was misclassified as a healthy sample from that same data set.

We had to use 75% of the data to train and tested the model on the remaining 25% testing set. The results were much better at predicting the class when reduced to two datasets than with the 5 data sets.

We could further get into the Chronic Fatigue Syndrome and Fibromyalgia. Maybe later. We still want to discover some relationships in our EBV associated pathologies of lymphomas and gastrointestinal tract diseases.