We went over some Tableau genes in our top genes of non-EBV associated pathologies compared to top genes of many studies we found the top genes by fold change values and tested for predictive accuracy and found to be better than 70% and some even better than 90% in classifying their respective class in that study.
We now want to compare these genes that we viewed relationships of fold change from acute infectious mononucleosis (AIM) and Chronic Active Epstein-Barr Virus (CAEBV). We compared 12 genes that showed some positive and negative correlations as far as how big of a magnitude in the same or opposite direction the gene expression fold change of pathology compared to baseline pathology. Here are those genes and what we found.
MT1G — very up regulated positive correlation with AIM & CAEBV, Fibromyalgia (FM) and chronic fatigue syndrome (CFS), slightly up regulated in Lyme disease (LD)
PRIM1 — very up regulated in FM & LD, slightly up regulated in CFS & CAEBV
TNFAIP6 — very down regulated in AIM, FM, LD, and uterine leiomyoma (UL), but up regulated in CFS, and slightly up in CAEBV
ANKRD22 — very upregulated in CAEBV & UL, slightly up in AIM, down regulated in CFS, LD, & Autism
ASPM — very up in AIM & UL, slight up in CAEBV, CFS, & LD
HISTIH3B — very up in AIM and UL, slight up in CAEBV & LD
OLR1 — very down regulated in AIM & CAEBV & LD, and very up regulated in UL & CFS
IRG1 — not in any sample
KIF11 — very up in AIM & CAEBV & UL & FM, slight up in CFS
ILG — very down in AIM & CAEBV & FM, slight down in LD, and slight up in CFS
ILIA — very down in AIM & CAEBV, slight down in CFS & LD
DTL — very up in AIM & FM, slight up in UL & LD & CFS
FFAR2 — very down in AIM & CAEBV & LD & CFS, but very up in FM & UL
GPR84 — very down in AIM, slight down in CAEBV & CFS, slight up in UL
CCNA2 — very up in AIM & UL & FM, slight up in LD & CFS
CCL20 — very down in AIM & CAEBV & LD, but very up in CFS
Lets make a string vector of these genes to pull from these datasets.
relationalGenes <- c("ASPM","HISTIH3B","OLR1","IRG1","KIF11",
"ILG", "ILIA", "DTL", "FFAR2", "GPR84",
"CCNA2", "CCL20")
relationalGenes
## [1] "ASPM" "HISTIH3B" "OLR1" "IRG1" "KIF11" "ILG"
## [7] "ILIA" "DTL" "FFAR2" "GPR84" "CCNA2" "CCL20"
Lets read in some packages
library(rmarkdown)
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
You can retrieve these data sets here at these links:
Lets read in the mono and EBV dataset first.
pathMono <- "path to CAEBV_genes_32670_FCs.csv"
setwd(pathMono)
monoEBV <- read.csv("CAEBV_genes_32670_FCs.csv") #32670 X 24
paged_table(monoEBV[1:10,])
Now lets read in the Fibromyalgia then the other datasets of non-EBV pathologies Chronic Fatigue Syndrome, Lyme Disease, and Uterine Leiomyoma.
pathFM <- "path to GeneSymbols_FM_FCs_filtered.csv"
setwd(pathFM)
FM <- read.csv("GeneSymbols_FM_FCs_filtered.csv") # 20142 X 17
paged_table(FM[1:10,])
Now read in the chronic fatigue syndrome dataset.
pathCFS <- "path to CFS_data_filtered_ordered_GSE293840.csv"
setwd(pathCFS)
CFS <- read.csv("CFS_data_filtered_ordered_GSE293840.csv") # 39378 X 174
paged_table(CFS[1:10,])
Lets read in the lyme disease dataset.
pathLyme <- "path to LymeDiseaseNormalizedFCsMeansAdded_June4th2026_ABS_min-x.csv"
setwd(pathLyme)
Lyme <- read.csv("LymeDiseaseNormalizedFCsMeansAdded_June4th2026_ABS_min-x.csv") #19526 X 95
paged_table(Lyme[1:10,])
Now we will add in the uterine leiomyoma dataset.
pathUL <- "path to UL_all_FCs_58735_notFiltered_hasNaNs_hasINf.csv"
setwd(pathUL)
UL <- read.csv("UL_all_FCs_58735_notFiltered_hasNaNs_hasINf.csv") #58735X36
paged_table(UL[1:10,])
Now we have 5 datasets to combine and get the genes from our relationalGenes string of genes.
=============================================================================
Part 2
6/30/2026 Tuesday 730pm
We want to understand why there were 12 genes with relationships with these non-EBV associated pathologies but at most there were only 9 genes in the data. We should have had at least the mono & EBV dataset have all 12 genes, maybe there was a typo somewhere.
relationalGenes
## [1] "ASPM" "HISTIH3B" "OLR1" "IRG1" "KIF11" "ILG"
## [7] "ILIA" "DTL" "FFAR2" "GPR84" "CCNA2" "CCL20"
The three genes not in the mono and EBV data set are HISTIH3B, ILG, and ILIA.
We scrolled the data set monoEBV by gene to the I variables and saw that we mistook the G in ILG that should be a 6 as in IL6, and the ILIA should be IL1A. And HISTIH3B should be HIST1H3B. Lets replace these in our character string of relationalGenes.
relationalGenes2 <- c("ASPM" , "HIST1H3B" ,"OLR1" , "IRG1" , "KIF11" , "IL6", "IL1A" , "DTL" , "FFAR2" , "GPR84" , "CCNA2" , "CCL20")
relationalGenes2
## [1] "ASPM" "HIST1H3B" "OLR1" "IRG1" "KIF11" "IL6"
## [7] "IL1A" "DTL" "FFAR2" "GPR84" "CCNA2" "CCL20"
We only want the actual samples and the gene name for all of these datasets, so we will omit the mean and FCs from each dataset as well as the ensembl ID and other features if present and not a sample of the gene name.
monoEBV_strict <- monoEBV[which(monoEBV$gene %in% relationalGenes2), c(2:19)]
paged_table(monoEBV_strict) #12 of the 12 relational genes
Lets make a class string for the monoEBV.
aim <- grep("AIM",colnames(monoEBV_strict))
caebv <- grep("CAEBV",colnames(monoEBV_strict))
healthy <- grep("healthy", colnames(monoEBV_strict))
classMono <- "gene"
classMono[aim] <- "AIM"
classMono[caebv] <- "CAEBV"
classMono[healthy] <- "healthy mono caebv"
colnames(monoEBV_strict)
## [1] "gene" "GSM2279022_AIM" "GSM2279023_AIM"
## [4] "GSM2279024_AIM" "GSM2279025_CAEBV" "GSM2279026_AIM"
## [7] "GSM2279027_CAEBV" "GSM2279028_CAEBV" "GSM2279029_CAEBV"
## [10] "GSM2279030_CAEBV" "GSM2279031_healthy" "GSM2279032_healthy"
## [13] "GSM2279033_healthy" "GSM2279034_healthy" "GSM2279035_healthy"
## [16] "GSM2279036_healthy" "GSM2279037_AIM" "GSM2279038_AIM"
classMono
## [1] "gene" "AIM" "AIM"
## [4] "AIM" "CAEBV" "AIM"
## [7] "CAEBV" "CAEBV" "CAEBV"
## [10] "CAEBV" "healthy mono caebv" "healthy mono caebv"
## [13] "healthy mono caebv" "healthy mono caebv" "healthy mono caebv"
## [16] "healthy mono caebv" "AIM" "AIM"
Those match without arranging order of the samples by type.
Lets now get the fibromyalgia dataset samples.
colnames(FM)
## [1] "gene_id" "gene_name" "Healthy1" "Healthy2"
## [5] "Healthy3" "Healthy4" "Healthy5" "myo1"
## [9] "myo2" "myo3" "myo4" "myo5"
## [13] "myo6" "myo7" "healthy_Mean" "myo_Mean"
## [17] "FC_myo_healthy"
FM_strict <- FM[which(FM$gene_name %in% relationalGenes2),c(2:14)]
paged_table(FM_strict) #5 of the 12 relational genes
colnames(FM_strict)
## [1] "gene_name" "Healthy1" "Healthy2" "Healthy3" "Healthy4" "Healthy5"
## [7] "myo1" "myo2" "myo3" "myo4" "myo5" "myo6"
## [13] "myo7"
classFM <- "gene"
healthyFM <- grep("Healthy", colnames(FM_strict))
fibromyalgia <- grep("myo",colnames(FM_strict))
classFM[healthyFM] <- 'healthy FM'
classFM[fibromyalgia] <- 'fibromyalgia'
classFM
## [1] "gene" "healthy FM" "healthy FM" "healthy FM" "healthy FM"
## [6] "healthy FM" "fibromyalgia" "fibromyalgia" "fibromyalgia" "fibromyalgia"
## [11] "fibromyalgia" "fibromyalgia" "fibromyalgia"
Lets now get the Chronic Fatigue Syndrome data in same format.
colnames(CFS)
## [1] "gene_id" "gene_name" "Ensembl_transcript"
## [4] "control_1" "control_2" "control_3"
## [7] "case_4" "control_5" "case_6"
## [10] "control_7" "control_8" "case_11"
## [13] "case_12" "case_13" "case_14"
## [16] "control_15" "case_16" "control_17"
## [19] "case_18" "control_21" "control_22"
## [22] "case_23" "control_24" "case_25"
## [25] "case_26" "case_27" "case_28"
## [28] "case_31" "control_32" "case_33"
## [31] "control_34" "case_35" "control_36"
## [34] "control_37" "control_38" "case_41"
## [37] "case_42" "control_43" "control_44"
## [40] "control_45" "case_46" "control_47"
## [43] "control_48" "control_51" "case_52"
## [46] "case_53" "control_54" "control_55"
## [49] "case_56" "control_57" "case_58"
## [52] "control_59" "control_60" "case_63"
## [55] "case_64" "case_65" "control_66"
## [58] "case_67" "control_68" "case_69"
## [61] "case_70" "case_71" "control_72"
## [64] "case_139" "case_140" "case_141"
## [67] "case_142" "control_143" "control_145"
## [70] "control_146" "control_147" "case_148"
## [73] "case_150" "control_151" "control_152"
## [76] "case_153" "case_154" "control_155"
## [79] "case_156" "case_157" "case_159"
## [82] "case_160" "control_161" "control_162"
## [85] "case_163" "case_164" "control_165"
## [88] "case_166" "case_167" "control_168"
## [91] "control_169" "case_170" "case_171"
## [94] "case_173" "case_174" "case_177"
## [97] "case_178" "case_179" "control_181"
## [100] "case_182" "control_183" "control_184"
## [103] "control_185" "case_186" "control_187"
## [106] "control_188" "control_189" "control_190"
## [109] "case_192" "control_193" "control_194"
## [112] "control_195" "case_196" "case_197"
## [115] "case_198" "control_199" "case_200"
## [118] "case_201" "case_202" "case_204"
## [121] "case_205" "case_206" "control_207"
## [124] "control_208" "control_209" "case_211"
## [127] "control_212" "case_213" "control_214"
## [130] "control_215" "case_219" "control_220"
## [133] "case_221" "case_222" "case_223"
## [136] "case_224" "case_225" "case_226"
## [139] "case_230" "case_231" "control_232"
## [142] "case_233" "case_235" "control_236"
## [145] "case_240" "case_241" "case_242"
## [148] "control_243" "control_244" "case_245"
## [151] "control_246" "case_247" "case_248"
## [154] "case_251" "control_252" "case_253"
## [157] "case_254" "control_255" "control_256"
## [160] "control_257" "case_258" "case_259"
## [163] "control_260" "control_264" "case_265"
## [166] "case_266" "control_267" "control_268"
## [169] "case_270" "control_271" "case_272"
## [172] "healthy_mean" "CSF_mean" "foldchange_CSF_healthy"
CFS_strict <- CFS[which(CFS$gene_name %in% relationalGenes2),c(2,4:171)]
colnames(CFS_strict) #10 genes of the 12 relational genes
## [1] "gene_name" "control_1" "control_2" "control_3" "case_4"
## [6] "control_5" "case_6" "control_7" "control_8" "case_11"
## [11] "case_12" "case_13" "case_14" "control_15" "case_16"
## [16] "control_17" "case_18" "control_21" "control_22" "case_23"
## [21] "control_24" "case_25" "case_26" "case_27" "case_28"
## [26] "case_31" "control_32" "case_33" "control_34" "case_35"
## [31] "control_36" "control_37" "control_38" "case_41" "case_42"
## [36] "control_43" "control_44" "control_45" "case_46" "control_47"
## [41] "control_48" "control_51" "case_52" "case_53" "control_54"
## [46] "control_55" "case_56" "control_57" "case_58" "control_59"
## [51] "control_60" "case_63" "case_64" "case_65" "control_66"
## [56] "case_67" "control_68" "case_69" "case_70" "case_71"
## [61] "control_72" "case_139" "case_140" "case_141" "case_142"
## [66] "control_143" "control_145" "control_146" "control_147" "case_148"
## [71] "case_150" "control_151" "control_152" "case_153" "case_154"
## [76] "control_155" "case_156" "case_157" "case_159" "case_160"
## [81] "control_161" "control_162" "case_163" "case_164" "control_165"
## [86] "case_166" "case_167" "control_168" "control_169" "case_170"
## [91] "case_171" "case_173" "case_174" "case_177" "case_178"
## [96] "case_179" "control_181" "case_182" "control_183" "control_184"
## [101] "control_185" "case_186" "control_187" "control_188" "control_189"
## [106] "control_190" "case_192" "control_193" "control_194" "control_195"
## [111] "case_196" "case_197" "case_198" "control_199" "case_200"
## [116] "case_201" "case_202" "case_204" "case_205" "case_206"
## [121] "control_207" "control_208" "control_209" "case_211" "control_212"
## [126] "case_213" "control_214" "control_215" "case_219" "control_220"
## [131] "case_221" "case_222" "case_223" "case_224" "case_225"
## [136] "case_226" "case_230" "case_231" "control_232" "case_233"
## [141] "case_235" "control_236" "case_240" "case_241" "case_242"
## [146] "control_243" "control_244" "case_245" "control_246" "case_247"
## [151] "case_248" "case_251" "control_252" "case_253" "case_254"
## [156] "control_255" "control_256" "control_257" "case_258" "case_259"
## [161] "control_260" "control_264" "case_265" "case_266" "control_267"
## [166] "control_268" "case_270" "control_271" "case_272"
classCFS <- "gene"
cfs <- grep('case',colnames(CFS_strict))
healthyCFS <- grep('control',colnames(CFS_strict))
classCFS[cfs] <- "Chronic Fatigue Syndrome"
classCFS[healthyCFS] <- "healthy CFS"
classCFS
## [1] "gene" "healthy CFS"
## [3] "healthy CFS" "healthy CFS"
## [5] "Chronic Fatigue Syndrome" "healthy CFS"
## [7] "Chronic Fatigue Syndrome" "healthy CFS"
## [9] "healthy CFS" "Chronic Fatigue Syndrome"
## [11] "Chronic Fatigue Syndrome" "Chronic Fatigue Syndrome"
## [13] "Chronic Fatigue Syndrome" "healthy CFS"
## [15] "Chronic Fatigue Syndrome" "healthy CFS"
## [17] "Chronic Fatigue Syndrome" "healthy CFS"
## [19] "healthy CFS" "Chronic Fatigue Syndrome"
## [21] "healthy CFS" "Chronic Fatigue Syndrome"
## [23] "Chronic Fatigue Syndrome" "Chronic Fatigue Syndrome"
## [25] "Chronic Fatigue Syndrome" "Chronic Fatigue Syndrome"
## [27] "healthy CFS" "Chronic Fatigue Syndrome"
## [29] "healthy CFS" "Chronic Fatigue Syndrome"
## [31] "healthy CFS" "healthy CFS"
## [33] "healthy CFS" "Chronic Fatigue Syndrome"
## [35] "Chronic Fatigue Syndrome" "healthy CFS"
## [37] "healthy CFS" "healthy CFS"
## [39] "Chronic Fatigue Syndrome" "healthy CFS"
## [41] "healthy CFS" "healthy CFS"
## [43] "Chronic Fatigue Syndrome" "Chronic Fatigue Syndrome"
## [45] "healthy CFS" "healthy CFS"
## [47] "Chronic Fatigue Syndrome" "healthy CFS"
## [49] "Chronic Fatigue Syndrome" "healthy CFS"
## [51] "healthy CFS" "Chronic Fatigue Syndrome"
## [53] "Chronic Fatigue Syndrome" "Chronic Fatigue Syndrome"
## [55] "healthy CFS" "Chronic Fatigue Syndrome"
## [57] "healthy CFS" "Chronic Fatigue Syndrome"
## [59] "Chronic Fatigue Syndrome" "Chronic Fatigue Syndrome"
## [61] "healthy CFS" "Chronic Fatigue Syndrome"
## [63] "Chronic Fatigue Syndrome" "Chronic Fatigue Syndrome"
## [65] "Chronic Fatigue Syndrome" "healthy CFS"
## [67] "healthy CFS" "healthy CFS"
## [69] "healthy CFS" "Chronic Fatigue Syndrome"
## [71] "Chronic Fatigue Syndrome" "healthy CFS"
## [73] "healthy CFS" "Chronic Fatigue Syndrome"
## [75] "Chronic Fatigue Syndrome" "healthy CFS"
## [77] "Chronic Fatigue Syndrome" "Chronic Fatigue Syndrome"
## [79] "Chronic Fatigue Syndrome" "Chronic Fatigue Syndrome"
## [81] "healthy CFS" "healthy CFS"
## [83] "Chronic Fatigue Syndrome" "Chronic Fatigue Syndrome"
## [85] "healthy CFS" "Chronic Fatigue Syndrome"
## [87] "Chronic Fatigue Syndrome" "healthy CFS"
## [89] "healthy CFS" "Chronic Fatigue Syndrome"
## [91] "Chronic Fatigue Syndrome" "Chronic Fatigue Syndrome"
## [93] "Chronic Fatigue Syndrome" "Chronic Fatigue Syndrome"
## [95] "Chronic Fatigue Syndrome" "Chronic Fatigue Syndrome"
## [97] "healthy CFS" "Chronic Fatigue Syndrome"
## [99] "healthy CFS" "healthy CFS"
## [101] "healthy CFS" "Chronic Fatigue Syndrome"
## [103] "healthy CFS" "healthy CFS"
## [105] "healthy CFS" "healthy CFS"
## [107] "Chronic Fatigue Syndrome" "healthy CFS"
## [109] "healthy CFS" "healthy CFS"
## [111] "Chronic Fatigue Syndrome" "Chronic Fatigue Syndrome"
## [113] "Chronic Fatigue Syndrome" "healthy CFS"
## [115] "Chronic Fatigue Syndrome" "Chronic Fatigue Syndrome"
## [117] "Chronic Fatigue Syndrome" "Chronic Fatigue Syndrome"
## [119] "Chronic Fatigue Syndrome" "Chronic Fatigue Syndrome"
## [121] "healthy CFS" "healthy CFS"
## [123] "healthy CFS" "Chronic Fatigue Syndrome"
## [125] "healthy CFS" "Chronic Fatigue Syndrome"
## [127] "healthy CFS" "healthy CFS"
## [129] "Chronic Fatigue Syndrome" "healthy CFS"
## [131] "Chronic Fatigue Syndrome" "Chronic Fatigue Syndrome"
## [133] "Chronic Fatigue Syndrome" "Chronic Fatigue Syndrome"
## [135] "Chronic Fatigue Syndrome" "Chronic Fatigue Syndrome"
## [137] "Chronic Fatigue Syndrome" "Chronic Fatigue Syndrome"
## [139] "healthy CFS" "Chronic Fatigue Syndrome"
## [141] "Chronic Fatigue Syndrome" "healthy CFS"
## [143] "Chronic Fatigue Syndrome" "Chronic Fatigue Syndrome"
## [145] "Chronic Fatigue Syndrome" "healthy CFS"
## [147] "healthy CFS" "Chronic Fatigue Syndrome"
## [149] "healthy CFS" "Chronic Fatigue Syndrome"
## [151] "Chronic Fatigue Syndrome" "Chronic Fatigue Syndrome"
## [153] "healthy CFS" "Chronic Fatigue Syndrome"
## [155] "Chronic Fatigue Syndrome" "healthy CFS"
## [157] "healthy CFS" "healthy CFS"
## [159] "Chronic Fatigue Syndrome" "Chronic Fatigue Syndrome"
## [161] "healthy CFS" "healthy CFS"
## [163] "Chronic Fatigue Syndrome" "Chronic Fatigue Syndrome"
## [165] "healthy CFS" "healthy CFS"
## [167] "Chronic Fatigue Syndrome" "healthy CFS"
## [169] "Chronic Fatigue Syndrome"
Lets do the same thing or process to the UL dataset.
colnames(UL)
## [1] "GeneID" "GeneSymbol" "GeneBiotype"
## [4] "MyoF.348_S12_white" "MyoF.428_S11_white" "MyoF.483_S8_black"
## [7] "MyoF.526_S10_white" "MyoF.UI.10_S7_black" "MyoF.UI.13_S9_black"
## [10] "MyoN.432_S4_white" "MyoN.514_S2_black" "MyoN.549_S5_white"
## [13] "MyoN.UI.20_S1_black" "MyoN.UI.43_S3_black" "MyoN.UI.8_S6_white"
## [16] "UF.372_S18_white" "UF.428_S17_white" "UF.483_S14_black"
## [19] "UF.526_S16_white" "UF.UI.13_S15_black" "UF.UI.23_S13_black"
## [22] "normal_all_mean" "UF_all_mean" "UF_all_risk_mean"
## [25] "normal_white_mean" "UF_white_mean" "UF_risk_white_mean"
## [28] "normal_black_mean" "UF_black_mean" "UF_risk_black_mean"
## [31] "UF_normal_all_FC" "UF_risk_normal_all_FC" "UF_normal_white_FC"
## [34] "UF_risk_white_FC" "UF_normal_black_FC" "UF_risk_black_FC"
UL_strict <- UL[which(UL$GeneSymbol %in% relationalGenes2),c(2,4:21)]
colnames(UL_strict)
## [1] "GeneSymbol" "MyoF.348_S12_white" "MyoF.428_S11_white"
## [4] "MyoF.483_S8_black" "MyoF.526_S10_white" "MyoF.UI.10_S7_black"
## [7] "MyoF.UI.13_S9_black" "MyoN.432_S4_white" "MyoN.514_S2_black"
## [10] "MyoN.549_S5_white" "MyoN.UI.20_S1_black" "MyoN.UI.43_S3_black"
## [13] "MyoN.UI.8_S6_white" "UF.372_S18_white" "UF.428_S17_white"
## [16] "UF.483_S14_black" "UF.526_S16_white" "UF.UI.13_S15_black"
## [19] "UF.UI.23_S13_black"
Note that in this study the MyoF is at risk tissue next to the uterine fibroid, the MyoN is normal myometrial tissue from somebody completely different, and the UF is the uterine fibroid.
classUL <- "gene"
healthyUL <- grep("MyoN",colnames(UL_strict))
ul <- grep("UF", colnames(UL_strict))
ulRisk <- grep("MyoF", colnames(UL_strict))
classUL[healthyUL] <- 'healthy uterine tissue'
classUL[ul] <- 'uterine leiomyoma'
classUL[ulRisk] <- 'UL surrounding tissue'
classUL
## [1] "gene" "UL surrounding tissue" "UL surrounding tissue"
## [4] "UL surrounding tissue" "UL surrounding tissue" "UL surrounding tissue"
## [7] "UL surrounding tissue" "healthy uterine tissue" "healthy uterine tissue"
## [10] "healthy uterine tissue" "healthy uterine tissue" "healthy uterine tissue"
## [13] "healthy uterine tissue" "uterine leiomyoma" "uterine leiomyoma"
## [16] "uterine leiomyoma" "uterine leiomyoma" "uterine leiomyoma"
## [19] "uterine leiomyoma"
Now for the Lyme disease data to be arranged as the others.
colnames(Lyme)
## [1] "Gene" "healthyControl_1"
## [3] "healthyControl_2" "healthyControl_3"
## [5] "healthyControl_4" "healthyControl_5"
## [7] "healthyControl_6" "healthyControl_7"
## [9] "healthyControl_8" "healthyControl_9"
## [11] "healthyControl_10" "healthyControl_11"
## [13] "healthyControl_12" "healthyControl_13"
## [15] "healthyControl_14" "healthyControl_15"
## [17] "healthyControl_16" "healthyControl_17"
## [19] "healthyControl_18" "healthyControl_19"
## [21] "healthyControl_20" "healthyControl_21"
## [23] "acuteLymeDisease_1" "acuteLymeDisease_2"
## [25] "acuteLymeDisease_3" "acuteLymeDisease_4"
## [27] "acuteLymeDisease_5" "acuteLymeDisease_6"
## [29] "acuteLymeDisease_7" "acuteLymeDisease_8"
## [31] "acuteLymeDisease_9" "acuteLymeDisease_10"
## [33] "acuteLymeDisease_11" "acuteLymeDisease_12"
## [35] "acuteLymeDisease_13" "acuteLymeDisease_14"
## [37] "acuteLymeDisease_15" "acuteLymeDisease_16"
## [39] "acuteLymeDisease_17" "acuteLymeDisease_18"
## [41] "acuteLymeDisease_19" "acuteLymeDisease_20"
## [43] "acuteLymeDisease_21" "acuteLymeDisease_22"
## [45] "acuteLymeDisease_23" "acuteLymeDisease_24"
## [47] "acuteLymeDisease_25" "acuteLymeDisease_26"
## [49] "acuteLymeDisease_27" "acuteLymeDisease_28"
## [51] "Antibodies_1month_1" "Antibodies_1month_2"
## [53] "Antibodies_1month_3" "Antibodies_1month_4"
## [55] "Antibodies_1month_5" "Antibodies_1month_6"
## [57] "Antibodies_1month_7" "Antibodies_1month_8"
## [59] "Antibodies_1month_9" "Antibodies_1month_10"
## [61] "Antibodies_1month_11" "Antibodies_1month_12"
## [63] "Antibodies_1month_13" "Antibodies_1month_14"
## [65] "Antibodies_1month_15" "Antibodies_1month_16"
## [67] "Antibodies_1month_17" "Antibodies_1month_18"
## [69] "Antibodies_1month_19" "Antibodies_1month_20"
## [71] "Antibodies_1month_21" "Antibodies_1month_22"
## [73] "Antibodies_1month_23" "Antibodies_1month_24"
## [75] "Antibodies_1month_25" "Antibodies_1month_26"
## [77] "Antibodies_1month_27" "Antibodies_6months_1"
## [79] "Antibodies_6months_2" "Antibodies_6months_3"
## [81] "Antibodies_6months_4" "Antibodies_6months_5"
## [83] "Antibodies_6months_6" "Antibodies_6months_7"
## [85] "Antibodies_6months_8" "Antibodies_6months_9"
## [87] "Antibodies_6months_10" "healthy_mean"
## [89] "acute_mean" "month1_mean"
## [91] "month6_mean" "foldchange_acute_healthy"
## [93] "foldchange_1month_healthy" "foldchange_6month_healthy"
## [95] "foldchange_6month_acute"
Lyme_strict <- Lyme[which(Lyme$Gene %in% relationalGenes2),c(1:87)]
colnames(Lyme_strict) #11 genes of the 12
## [1] "Gene" "healthyControl_1" "healthyControl_2"
## [4] "healthyControl_3" "healthyControl_4" "healthyControl_5"
## [7] "healthyControl_6" "healthyControl_7" "healthyControl_8"
## [10] "healthyControl_9" "healthyControl_10" "healthyControl_11"
## [13] "healthyControl_12" "healthyControl_13" "healthyControl_14"
## [16] "healthyControl_15" "healthyControl_16" "healthyControl_17"
## [19] "healthyControl_18" "healthyControl_19" "healthyControl_20"
## [22] "healthyControl_21" "acuteLymeDisease_1" "acuteLymeDisease_2"
## [25] "acuteLymeDisease_3" "acuteLymeDisease_4" "acuteLymeDisease_5"
## [28] "acuteLymeDisease_6" "acuteLymeDisease_7" "acuteLymeDisease_8"
## [31] "acuteLymeDisease_9" "acuteLymeDisease_10" "acuteLymeDisease_11"
## [34] "acuteLymeDisease_12" "acuteLymeDisease_13" "acuteLymeDisease_14"
## [37] "acuteLymeDisease_15" "acuteLymeDisease_16" "acuteLymeDisease_17"
## [40] "acuteLymeDisease_18" "acuteLymeDisease_19" "acuteLymeDisease_20"
## [43] "acuteLymeDisease_21" "acuteLymeDisease_22" "acuteLymeDisease_23"
## [46] "acuteLymeDisease_24" "acuteLymeDisease_25" "acuteLymeDisease_26"
## [49] "acuteLymeDisease_27" "acuteLymeDisease_28" "Antibodies_1month_1"
## [52] "Antibodies_1month_2" "Antibodies_1month_3" "Antibodies_1month_4"
## [55] "Antibodies_1month_5" "Antibodies_1month_6" "Antibodies_1month_7"
## [58] "Antibodies_1month_8" "Antibodies_1month_9" "Antibodies_1month_10"
## [61] "Antibodies_1month_11" "Antibodies_1month_12" "Antibodies_1month_13"
## [64] "Antibodies_1month_14" "Antibodies_1month_15" "Antibodies_1month_16"
## [67] "Antibodies_1month_17" "Antibodies_1month_18" "Antibodies_1month_19"
## [70] "Antibodies_1month_20" "Antibodies_1month_21" "Antibodies_1month_22"
## [73] "Antibodies_1month_23" "Antibodies_1month_24" "Antibodies_1month_25"
## [76] "Antibodies_1month_26" "Antibodies_1month_27" "Antibodies_6months_1"
## [79] "Antibodies_6months_2" "Antibodies_6months_3" "Antibodies_6months_4"
## [82] "Antibodies_6months_5" "Antibodies_6months_6" "Antibodies_6months_7"
## [85] "Antibodies_6months_8" "Antibodies_6months_9" "Antibodies_6months_10"
classLyme <- "gene"
healthyLyme <- grep('healthy',colnames(Lyme_strict))
acute <- grep('acute', colnames(Lyme_strict))
lyme1 <- grep('1month', colnames(Lyme_strict))
lyme6 <- grep('6month', colnames(Lyme_strict))
classLyme[healthyLyme] <- "healthy before lyme disease"
classLyme[acute] <- "lyme disease acute"
classLyme[lyme1] <- "lyme disease 1 month"
classLyme[lyme6] <- "lyme disease 6 months"
classLyme
## [1] "gene" "healthy before lyme disease"
## [3] "healthy before lyme disease" "healthy before lyme disease"
## [5] "healthy before lyme disease" "healthy before lyme disease"
## [7] "healthy before lyme disease" "healthy before lyme disease"
## [9] "healthy before lyme disease" "healthy before lyme disease"
## [11] "healthy before lyme disease" "healthy before lyme disease"
## [13] "healthy before lyme disease" "healthy before lyme disease"
## [15] "healthy before lyme disease" "healthy before lyme disease"
## [17] "healthy before lyme disease" "healthy before lyme disease"
## [19] "healthy before lyme disease" "healthy before lyme disease"
## [21] "healthy before lyme disease" "healthy before lyme disease"
## [23] "lyme disease acute" "lyme disease acute"
## [25] "lyme disease acute" "lyme disease acute"
## [27] "lyme disease acute" "lyme disease acute"
## [29] "lyme disease acute" "lyme disease acute"
## [31] "lyme disease acute" "lyme disease acute"
## [33] "lyme disease acute" "lyme disease acute"
## [35] "lyme disease acute" "lyme disease acute"
## [37] "lyme disease acute" "lyme disease acute"
## [39] "lyme disease acute" "lyme disease acute"
## [41] "lyme disease acute" "lyme disease acute"
## [43] "lyme disease acute" "lyme disease acute"
## [45] "lyme disease acute" "lyme disease acute"
## [47] "lyme disease acute" "lyme disease acute"
## [49] "lyme disease acute" "lyme disease acute"
## [51] "lyme disease 1 month" "lyme disease 1 month"
## [53] "lyme disease 1 month" "lyme disease 1 month"
## [55] "lyme disease 1 month" "lyme disease 1 month"
## [57] "lyme disease 1 month" "lyme disease 1 month"
## [59] "lyme disease 1 month" "lyme disease 1 month"
## [61] "lyme disease 1 month" "lyme disease 1 month"
## [63] "lyme disease 1 month" "lyme disease 1 month"
## [65] "lyme disease 1 month" "lyme disease 1 month"
## [67] "lyme disease 1 month" "lyme disease 1 month"
## [69] "lyme disease 1 month" "lyme disease 1 month"
## [71] "lyme disease 1 month" "lyme disease 1 month"
## [73] "lyme disease 1 month" "lyme disease 1 month"
## [75] "lyme disease 1 month" "lyme disease 1 month"
## [77] "lyme disease 1 month" "lyme disease 6 months"
## [79] "lyme disease 6 months" "lyme disease 6 months"
## [81] "lyme disease 6 months" "lyme disease 6 months"
## [83] "lyme disease 6 months" "lyme disease 6 months"
## [85] "lyme disease 6 months" "lyme disease 6 months"
## [87] "lyme disease 6 months"
Lets look at the genes in the data by which genes in common among all these pathologies.
CFS_strict$gene_name
## [1] "CCL20" "IL6" "DTL" "KIF11" "CCNA2" "ASPM" "OLR1" "FFAR2" "GPR84"
## [10] "IL1A"
FM_strict$gene_name
## [1] "FFAR2" "IL6" "KIF11" "DTL" "CCNA2"
Lyme_strict$Gene
## [1] "OLR1" "IL1A" "CCL20" "IL6" "KIF11" "FFAR2"
## [7] "GPR84" "CCNA2" "ASPM" "DTL" "HIST1H3B"
monoEBV_strict$gene
## [1] "KIF11" "ASPM" "CCNA2" "HIST1H3B" "DTL" "FFAR2"
## [7] "IRG1" "GPR84" "IL6" "CCL20" "OLR1" "IL1A"
UL_strict$GeneSymbol
## [1] "ASPM" "DTL" "IL1A" "CCL20" "CCNA2" "HIST1H3B"
## [7] "IL6" "KIF11" "OLR1" "GPR84" "FFAR2"
It looks like the 5 genes that are limited in the uterine leiomyoma can be used in predicting the class of sample.( we redid the genes in the relational genes set due to an error in 3 genes misidentifying between fonts a 1 for an I)
genes4 <- FM_strict$gene_name
We are using the fibromyalgia or FM data of 4 genes that are common to the other data sets.
CFS4 <- CFS_strict[which(CFS_strict$gene_name %in% genes4),]
mono4 <- monoEBV_strict[which(monoEBV_strict$gene %in% genes4),]
Lyme4 <- Lyme_strict[which(Lyme_strict$Gene %in% genes4),]
UL4 <- UL_strict[which(UL_strict$GeneSymbol %in% genes4),]
Lets make our matrices for each of these and add in each class feature we just made.
CFS4_t <- data.frame(t(CFS4[,2:169]))
colnames(CFS4_t) <- CFS4$gene_name
CFS4_t$class <- classCFS[2:length(classCFS)]
paged_table(CFS4_t[1:10,])
*** This is where code has to be corrected to accommodate the new gene from error found. We have to redo the alphabetized order of the features now that we are working with 1 more gene.
CFS4_t2 <- CFS4_t[,c(4,2,5,1,3,6)]
colnames(CFS4_t2)
## [1] "CCNA2" "DTL" "FFAR2" "IL6" "KIF11" "class"
The above is the chronic fatigue syndrome, the next will be the fibromyalgia.
FM4_t <- data.frame(t(FM_strict[,2:13]))
colnames(FM4_t) <- FM_strict$gene_name
FM4_t$class <- classFM[2:length(classFM)]
paged_table(FM4_t)
FM4_t2 <- FM4_t[,c(5,4,1,2,3,6)]
colnames(FM4_t2)
## [1] "CCNA2" "DTL" "FFAR2" "IL6" "KIF11" "class"
Now for the Lyme disease data matrix. We just made the CFS and FM matrices and alphabatized the gene features.
Lyme4_t <- data.frame(t(Lyme4[,2:87]))
colnames(Lyme4_t) <- Lyme4$Gene
Lyme4_t$class <- classLyme[2:length(classLyme)]
paged_table(Lyme4_t[1:10,])
Lyme4_t2 <- Lyme4_t[,c(4,5,3,1,2,6)]
colnames(Lyme4_t2)
## [1] "CCNA2" "DTL" "FFAR2" "IL6" "KIF11" "class"
Next will be the UL matrix
UL4_t <- data.frame(t(UL4[,2:19]))
colnames(UL4_t) <- UL4$GeneSymbol
UL4_t$class <- classUL[2:length(classUL)]
paged_table(UL4_t[1:10,])
UL4_t2 <- UL4_t[,c(2,1,5,3,4,6)]
colnames(UL4_t2)
## [1] "CCNA2" "DTL" "FFAR2" "IL6" "KIF11" "class"
Next will be the last matrix of the mono and EBV genes.
mono4_t <- data.frame(t(mono4[,2:18]))
colnames(mono4_t) <- mono4$gene
mono4_t$class <- classMono[2:length(classMono)]
paged_table(mono4_t[1:10,])
mono4_t2 <- mono4_t[,c(2,3,4,5,1,6)]
colnames(mono4_t2)
## [1] "CCNA2" "DTL" "FFAR2" "IL6" "KIF11" "class"
Lets row bind all these samples together now that they have the same feature IDs by gene and class.
matrix5sets <- rbind(mono4_t2,FM4_t2,CFS4_t2,UL4_t2,Lyme4_t2)
# 301 X 6
paged_table(matrix5sets[c(1:10,50:75,100:125),])
write.csv(matrix5sets,'matrix5genes.csv', row.names=F)
Now lets replace the healthy samples to only have one sample name of healthy.
table(matrix5sets$class)
##
## AIM CAEBV
## 6 5
## Chronic Fatigue Syndrome fibromyalgia
## 93 7
## healthy before lyme disease healthy CFS
## 21 75
## healthy FM healthy mono caebv
## 5 6
## healthy uterine tissue lyme disease 1 month
## 6 27
## lyme disease 6 months lyme disease acute
## 10 28
## UL surrounding tissue uterine leiomyoma
## 6 6
healthy5 <- grep('healthy',matrix5sets$class)
matrix5sets$class[healthy5] <- 'healthy'
table(matrix5sets$class)
##
## AIM CAEBV Chronic Fatigue Syndrome
## 6 5 93
## fibromyalgia healthy lyme disease 1 month
## 7 113 27
## lyme disease 6 months lyme disease acute UL surrounding tissue
## 10 28 6
## uterine leiomyoma
## 6
write.csv(matrix5sets,'matrix5sets_healthy5into1healthy_part2with5genesNot4genes.csv', row.names=F)
matrix5sets$class <- as.factor(matrix5sets$class)
set.seed(125)
inTrain <- sample(1:301, .8*301)
training <- matrix5sets[inTrain,]
testing <- matrix5sets[-inTrain,]
table(training$class)
##
## AIM CAEBV Chronic Fatigue Syndrome
## 5 5 69
## fibromyalgia healthy lyme disease 1 month
## 6 91 26
## lyme disease 6 months lyme disease acute UL surrounding tissue
## 8 19 6
## uterine leiomyoma
## 5
table(testing$class)
##
## AIM CAEBV Chronic Fatigue Syndrome
## 1 0 24
## fibromyalgia healthy lyme disease 1 month
## 1 22 1
## lyme disease 6 months lyme disease acute UL surrounding tissue
## 2 9 0
## uterine leiomyoma
## 1
rf1 <- randomForest(training[1:5], training$class, mtry=3, ntree=5000, confusion=T)
rf1$confusion
## AIM CAEBV Chronic Fatigue Syndrome fibromyalgia
## AIM 5 0 0 0
## CAEBV 0 4 0 0
## Chronic Fatigue Syndrome 0 0 39 0
## fibromyalgia 0 0 0 3
## healthy 0 1 29 1
## lyme disease 1 month 0 0 0 1
## lyme disease 6 months 0 0 0 0
## lyme disease acute 0 0 0 0
## UL surrounding tissue 0 0 1 0
## uterine leiomyoma 0 0 1 0
## healthy lyme disease 1 month lyme disease 6 months
## AIM 0 0 0
## CAEBV 1 0 0
## Chronic Fatigue Syndrome 30 0 0
## fibromyalgia 1 1 0
## healthy 44 9 1
## lyme disease 1 month 8 13 1
## lyme disease 6 months 5 1 1
## lyme disease acute 7 5 0
## UL surrounding tissue 3 0 0
## uterine leiomyoma 3 0 0
## lyme disease acute UL surrounding tissue
## AIM 0 0
## CAEBV 0 0
## Chronic Fatigue Syndrome 0 0
## fibromyalgia 1 0
## healthy 5 0
## lyme disease 1 month 3 0
## lyme disease 6 months 1 0
## lyme disease acute 7 0
## UL surrounding tissue 0 0
## uterine leiomyoma 0 1
## uterine leiomyoma class.error
## AIM 0 0.0000000
## CAEBV 0 0.2000000
## Chronic Fatigue Syndrome 0 0.4347826
## fibromyalgia 0 0.5000000
## healthy 1 0.5164835
## lyme disease 1 month 0 0.5000000
## lyme disease 6 months 0 0.8750000
## lyme disease acute 0 0.6315789
## UL surrounding tissue 2 1.0000000
## uterine leiomyoma 0 1.0000000
prediction1 <- predict(rf1,testing)
results1 <- data.frame(predicted=prediction1, actual=testing$class)
results1
## predicted actual
## GSM2279024_AIM AIM AIM
## GSM2279035_healthy healthy healthy
## Healthy4 healthy healthy
## myo6 healthy fibromyalgia
## control_3 Chronic Fatigue Syndrome healthy
## control_24 Chronic Fatigue Syndrome healthy
## case_27 healthy Chronic Fatigue Syndrome
## control_37 Chronic Fatigue Syndrome healthy
## case_42 Chronic Fatigue Syndrome Chronic Fatigue Syndrome
## case_46 Chronic Fatigue Syndrome Chronic Fatigue Syndrome
## control_47 healthy healthy
## control_51 Chronic Fatigue Syndrome healthy
## case_58 healthy Chronic Fatigue Syndrome
## case_63 healthy Chronic Fatigue Syndrome
## case_67 Chronic Fatigue Syndrome Chronic Fatigue Syndrome
## case_140 Chronic Fatigue Syndrome Chronic Fatigue Syndrome
## control_143 Chronic Fatigue Syndrome healthy
## control_146 Chronic Fatigue Syndrome healthy
## control_147 healthy healthy
## case_148 Chronic Fatigue Syndrome Chronic Fatigue Syndrome
## control_152 healthy healthy
## case_153 healthy Chronic Fatigue Syndrome
## case_157 Chronic Fatigue Syndrome Chronic Fatigue Syndrome
## case_159 healthy Chronic Fatigue Syndrome
## case_164 Chronic Fatigue Syndrome Chronic Fatigue Syndrome
## case_173 Chronic Fatigue Syndrome Chronic Fatigue Syndrome
## control_181 Chronic Fatigue Syndrome healthy
## control_185 Chronic Fatigue Syndrome healthy
## control_189 Chronic Fatigue Syndrome healthy
## control_190 healthy healthy
## case_192 Chronic Fatigue Syndrome Chronic Fatigue Syndrome
## case_198 Chronic Fatigue Syndrome Chronic Fatigue Syndrome
## case_200 Chronic Fatigue Syndrome Chronic Fatigue Syndrome
## control_214 Chronic Fatigue Syndrome healthy
## case_221 healthy Chronic Fatigue Syndrome
## case_223 healthy Chronic Fatigue Syndrome
## case_224 Chronic Fatigue Syndrome Chronic Fatigue Syndrome
## case_225 Chronic Fatigue Syndrome Chronic Fatigue Syndrome
## case_230 healthy Chronic Fatigue Syndrome
## case_233 Chronic Fatigue Syndrome Chronic Fatigue Syndrome
## case_245 healthy Chronic Fatigue Syndrome
## case_254 healthy Chronic Fatigue Syndrome
## control_255 Chronic Fatigue Syndrome healthy
## control_264 Chronic Fatigue Syndrome healthy
## control_267 healthy healthy
## MyoN.549_S5_white healthy healthy
## UF.UI.13_S15_black UL surrounding tissue uterine leiomyoma
## healthyControl_6 lyme disease 1 month healthy
## healthyControl_11 lyme disease 6 months healthy
## acuteLymeDisease_3 lyme disease acute lyme disease acute
## acuteLymeDisease_6 lyme disease 1 month lyme disease acute
## acuteLymeDisease_8 lyme disease acute lyme disease acute
## acuteLymeDisease_11 lyme disease 1 month lyme disease acute
## acuteLymeDisease_14 lyme disease acute lyme disease acute
## acuteLymeDisease_16 lyme disease 1 month lyme disease acute
## acuteLymeDisease_21 lyme disease acute lyme disease acute
## acuteLymeDisease_22 lyme disease acute lyme disease acute
## acuteLymeDisease_28 lyme disease acute lyme disease acute
## Antibodies_1month_7 healthy lyme disease 1 month
## Antibodies_6months_7 healthy lyme disease 6 months
## Antibodies_6months_10 healthy lyme disease 6 months
**** Now lets compare mono & EBV with UL ****
UL_monoGenes <- c("ANKRD22","HIST1H3B","KIF11","FFAR2","CCNA2")
UL_monoGenes
## [1] "ANKRD22" "HIST1H3B" "KIF11" "FFAR2" "CCNA2"
UL_5 <- UL_strict[which(UL_strict$GeneSymbol %in% UL_monoGenes),]
mono_5 <- monoEBV_strict[which(monoEBV_strict$gene %in% UL_monoGenes),]
UL_5
## GeneSymbol MyoF.348_S12_white MyoF.428_S11_white MyoF.483_S8_black
## 14041 CCNA2 2 8 29
## 18178 HIST1H3B 0 1 10
## 35431 KIF11 10 16 116
## 54471 FFAR2 0 0 58
## MyoF.526_S10_white MyoF.UI.10_S7_black MyoF.UI.13_S9_black
## 14041 5 16 1
## 18178 0 2 1
## 35431 10 23 3
## 54471 0 0 0
## MyoN.432_S4_white MyoN.514_S2_black MyoN.549_S5_white MyoN.UI.20_S1_black
## 14041 2 4 1 5
## 18178 1 0 3 5
## 35431 3 6 2 5
## 54471 1 1 0 0
## MyoN.UI.43_S3_black MyoN.UI.8_S6_white UF.372_S18_white UF.428_S17_white
## 14041 9 1 8 5
## 18178 1 0 6 5
## 35431 15 2 9 14
## 54471 0 0 0 0
## UF.483_S14_black UF.526_S16_white UF.UI.13_S15_black UF.UI.23_S13_black
## 14041 4 7 1 4
## 18178 7 4 3 1
## 35431 13 15 8 11
## 54471 0 0 0 4
The ANKRD22 gene was in the mono data but for some reason didn’t get added to the relationalGenes in part 1 and also in part 2.
So we will redo these datasets of strict genes in common to run our machine model on and classify by the classes in total between UL and mono&EBV.
UL5 <- UL[which(UL$GeneSymbol %in% UL_monoGenes),]
UL5
## GeneID GeneSymbol GeneBiotype MyoF.348_S12_white
## 14041 ENSG00000145386 CCNA2 protein_coding 2
## 18178 ENSG00000274267 HIST1H3B protein_coding 0
## 35355 ENSG00000152766 ANKRD22 protein_coding 0
## 35431 ENSG00000138160 KIF11 protein_coding 10
## 54471 ENSG00000126262 FFAR2 protein_coding 0
## MyoF.428_S11_white MyoF.483_S8_black MyoF.526_S10_white
## 14041 8 29 5
## 18178 1 10 0
## 35355 2 41 0
## 35431 16 116 10
## 54471 0 58 0
## MyoF.UI.10_S7_black MyoF.UI.13_S9_black MyoN.432_S4_white
## 14041 16 1 2
## 18178 2 1 1
## 35355 3 0 0
## 35431 23 3 3
## 54471 0 0 1
## MyoN.514_S2_black MyoN.549_S5_white MyoN.UI.20_S1_black
## 14041 4 1 5
## 18178 0 3 5
## 35355 0 2 1
## 35431 6 2 5
## 54471 1 0 0
## MyoN.UI.43_S3_black MyoN.UI.8_S6_white UF.372_S18_white UF.428_S17_white
## 14041 9 1 8 5
## 18178 1 0 6 5
## 35355 1 0 0 1
## 35431 15 2 9 14
## 54471 0 0 0 0
## UF.483_S14_black UF.526_S16_white UF.UI.13_S15_black UF.UI.23_S13_black
## 14041 4 7 1 4
## 18178 7 4 3 1
## 35355 0 1 2 6
## 35431 13 15 8 11
## 54471 0 0 0 4
## normal_all_mean UF_all_mean UF_all_risk_mean normal_white_mean
## 14041 3.6666667 4.8333333 10.166667 1.3333333
## 18178 1.6666667 4.3333333 2.333333 1.3333333
## 35355 0.6666667 1.6666667 7.666667 0.6666667
## 35431 5.5000000 11.6666667 29.666667 2.3333333
## 54471 0.3333333 0.6666667 9.666667 0.3333333
## UF_white_mean UF_risk_white_mean normal_black_mean UF_black_mean
## 14041 6.6666667 5.0000000 6.0000000 3.000000
## 18178 5.0000000 0.3333333 2.0000000 3.666667
## 35355 0.6666667 0.6666667 0.6666667 2.666667
## 35431 12.6666667 12.0000000 8.6666667 10.666667
## 54471 0.0000000 0.0000000 0.3333333 1.333333
## UF_risk_black_mean UF_normal_all_FC UF_risk_normal_all_FC
## 14041 15.333333 1.318182 2.772727
## 18178 4.333333 2.600000 1.400000
## 35355 14.666667 2.500000 11.500000
## 35431 47.333333 2.121212 5.393939
## 54471 19.333333 2.000000 29.000000
## UF_normal_white_FC UF_risk_white_FC UF_normal_black_FC UF_risk_black_FC
## 14041 5.000000 3.750000 0.500000 2.555556
## 18178 3.750000 0.250000 1.833333 2.166667
## 35355 1.000000 1.000000 4.000000 22.000000
## 35431 5.428571 5.142857 1.230769 5.461538
## 54471 0.000000 0.000000 4.000000 58.000000
classUL
## [1] "gene" "UL surrounding tissue" "UL surrounding tissue"
## [4] "UL surrounding tissue" "UL surrounding tissue" "UL surrounding tissue"
## [7] "UL surrounding tissue" "healthy uterine tissue" "healthy uterine tissue"
## [10] "healthy uterine tissue" "healthy uterine tissue" "healthy uterine tissue"
## [13] "healthy uterine tissue" "uterine leiomyoma" "uterine leiomyoma"
## [16] "uterine leiomyoma" "uterine leiomyoma" "uterine leiomyoma"
## [19] "uterine leiomyoma"
colnames(UL5)
## [1] "GeneID" "GeneSymbol" "GeneBiotype"
## [4] "MyoF.348_S12_white" "MyoF.428_S11_white" "MyoF.483_S8_black"
## [7] "MyoF.526_S10_white" "MyoF.UI.10_S7_black" "MyoF.UI.13_S9_black"
## [10] "MyoN.432_S4_white" "MyoN.514_S2_black" "MyoN.549_S5_white"
## [13] "MyoN.UI.20_S1_black" "MyoN.UI.43_S3_black" "MyoN.UI.8_S6_white"
## [16] "UF.372_S18_white" "UF.428_S17_white" "UF.483_S14_black"
## [19] "UF.526_S16_white" "UF.UI.13_S15_black" "UF.UI.23_S13_black"
## [22] "normal_all_mean" "UF_all_mean" "UF_all_risk_mean"
## [25] "normal_white_mean" "UF_white_mean" "UF_risk_white_mean"
## [28] "normal_black_mean" "UF_black_mean" "UF_risk_black_mean"
## [31] "UF_normal_all_FC" "UF_risk_normal_all_FC" "UF_normal_white_FC"
## [34] "UF_risk_white_FC" "UF_normal_black_FC" "UF_risk_black_FC"
UL5_a <- UL5[,c(2,4:21)]
colnames(UL5_a)
## [1] "GeneSymbol" "MyoF.348_S12_white" "MyoF.428_S11_white"
## [4] "MyoF.483_S8_black" "MyoF.526_S10_white" "MyoF.UI.10_S7_black"
## [7] "MyoF.UI.13_S9_black" "MyoN.432_S4_white" "MyoN.514_S2_black"
## [10] "MyoN.549_S5_white" "MyoN.UI.20_S1_black" "MyoN.UI.43_S3_black"
## [13] "MyoN.UI.8_S6_white" "UF.372_S18_white" "UF.428_S17_white"
## [16] "UF.483_S14_black" "UF.526_S16_white" "UF.UI.13_S15_black"
## [19] "UF.UI.23_S13_black"
UL5_t <- data.frame(t(UL5_a[,2:19]))
colnames(UL5_t) <- UL5_a$GeneSymbol
UL5_t$class <- classUL[2:19]
paged_table(UL5_t)
UL5_t2 <- UL5_t[,c(3,1,5,2,4,6)]
colnames(UL5_t2)
## [1] "ANKRD22" "CCNA2" "FFAR2" "HIST1H3B" "KIF11" "class"
Now do the same with our mono&EBV data to get these same 5 genes.
mono5 <- monoEBV[which(monoEBV$gene %in% UL_monoGenes),]
mono5
## ID gene GSM2279022_AIM GSM2279023_AIM GSM2279024_AIM
## 2 41163 ANKRD22 2.618010 2.835821 3.124850
## 44 40350 KIF11 4.366528 5.016053 5.332066
## 139 19187 CCNA2 6.089649 6.562936 6.650323
## 200 25360 HIST1H3B 4.993432 5.096159 5.210186
## 32474 62551 FFAR2 2.785898 2.625908 2.813352
## GSM2279025_CAEBV GSM2279026_AIM GSM2279027_CAEBV GSM2279028_CAEBV
## 2 2.116486 3.088006 6.493938 4.197983
## 44 2.276617 5.826190 2.431858 1.983582
## 139 3.456825 7.208801 3.221267 2.579308
## 200 2.829063 5.446201 2.518839 1.946151
## 32474 2.935650 3.310563 7.966073 7.136373
## GSM2279029_CAEBV GSM2279030_CAEBV GSM2279031_healthy GSM2279032_healthy
## 2 5.118928 4.327162 2.008278 2.033418
## 44 3.295993 2.566960 2.152256 1.726631
## 139 3.996350 3.490396 3.071337 2.617173
## 200 2.969806 2.849361 2.415078 2.168727
## 32474 7.195934 8.300824 7.510760 7.128594
## GSM2279033_healthy GSM2279034_healthy GSM2279035_healthy
## 2 2.066117 2.379066 2.455447
## 44 1.783775 1.944827 1.861847
## 139 2.508872 2.747519 2.771293
## 200 2.168641 2.241686 2.189613
## 32474 7.895555 8.370027 9.220480
## GSM2279036_healthy GSM2279037_AIM GSM2279038_AIM AIM_mean CAEBV_mean
## 2 2.984194 2.525890 2.762189 2.825794 4.450900
## 44 1.952959 4.370372 4.858228 4.961573 2.511002
## 139 3.009310 6.004854 6.457274 6.495640 3.348829
## 200 2.175338 4.046600 4.509862 4.883740 2.622644
## 32474 9.529778 4.170920 3.058326 3.127494 6.706971
## healthy_mean FC_AIM_healthy FC_CAEBV_healthy
## 2 2.321087 1.2174445 1.9175929
## 44 1.903716 2.6062572 1.3190005
## 139 2.787584 2.3302040 1.2013375
## 200 2.226514 2.1934471 1.1779151
## 32474 8.275866 0.3779054 0.8104253
classMono
## [1] "gene" "AIM" "AIM"
## [4] "AIM" "CAEBV" "AIM"
## [7] "CAEBV" "CAEBV" "CAEBV"
## [10] "CAEBV" "healthy mono caebv" "healthy mono caebv"
## [13] "healthy mono caebv" "healthy mono caebv" "healthy mono caebv"
## [16] "healthy mono caebv" "AIM" "AIM"
colnames(mono5)
## [1] "ID" "gene" "GSM2279022_AIM"
## [4] "GSM2279023_AIM" "GSM2279024_AIM" "GSM2279025_CAEBV"
## [7] "GSM2279026_AIM" "GSM2279027_CAEBV" "GSM2279028_CAEBV"
## [10] "GSM2279029_CAEBV" "GSM2279030_CAEBV" "GSM2279031_healthy"
## [13] "GSM2279032_healthy" "GSM2279033_healthy" "GSM2279034_healthy"
## [16] "GSM2279035_healthy" "GSM2279036_healthy" "GSM2279037_AIM"
## [19] "GSM2279038_AIM" "AIM_mean" "CAEBV_mean"
## [22] "healthy_mean" "FC_AIM_healthy" "FC_CAEBV_healthy"
mono5_a <- mono5[,c(2:19)]
colnames(mono5_a)
## [1] "gene" "GSM2279022_AIM" "GSM2279023_AIM"
## [4] "GSM2279024_AIM" "GSM2279025_CAEBV" "GSM2279026_AIM"
## [7] "GSM2279027_CAEBV" "GSM2279028_CAEBV" "GSM2279029_CAEBV"
## [10] "GSM2279030_CAEBV" "GSM2279031_healthy" "GSM2279032_healthy"
## [13] "GSM2279033_healthy" "GSM2279034_healthy" "GSM2279035_healthy"
## [16] "GSM2279036_healthy" "GSM2279037_AIM" "GSM2279038_AIM"
mono5_t <- data.frame(t(mono5_a[,2:18]))
colnames(mono5_t) <- mono5_a$gene
mono5_t$class <- classMono[2:18]
paged_table(mono5_t)
mono5_t2 <- mono5_t[,c(1,3,5,4,2,6)]
colnames(mono5_t2)
## [1] "ANKRD22" "CCNA2" "FFAR2" "HIST1H3B" "KIF11" "class"
Lets combine these into a matrix of mono, EBV, and UL with the healthy samples in each dataset.
matrixMonoUL <- rbind(mono5_t2,UL5_t2)
paged_table(matrixMonoUL)
table(matrixMonoUL$class)
##
## AIM CAEBV healthy mono caebv
## 6 5 6
## healthy uterine tissue UL surrounding tissue uterine leiomyoma
## 6 6 6
matrixMonoUL$class <- as.factor(matrixMonoUL$class)
set.seed(1267)
inTrain <- sample(1:35,.75*35)
training <- matrixMonoUL[inTrain,]
testing <- matrixMonoUL[-inTrain,]
table(training$class)
##
## AIM CAEBV healthy mono caebv
## 5 4 4
## healthy uterine tissue UL surrounding tissue uterine leiomyoma
## 4 4 5
table(testing$class)
##
## AIM CAEBV healthy mono caebv
## 1 1 2
## healthy uterine tissue UL surrounding tissue uterine leiomyoma
## 2 2 1
rf <- randomForest(training[1:5], training$class, mtry=3, ntree=5000, confusion=T)
rf$confusion
## AIM CAEBV healthy mono caebv healthy uterine tissue
## AIM 5 0 0 0
## CAEBV 0 3 1 0
## healthy mono caebv 0 0 4 0
## healthy uterine tissue 0 0 0 3
## UL surrounding tissue 0 1 0 0
## uterine leiomyoma 0 0 0 0
## UL surrounding tissue uterine leiomyoma class.error
## AIM 0 0 0.00
## CAEBV 0 0 0.25
## healthy mono caebv 0 0 0.00
## healthy uterine tissue 0 1 0.25
## UL surrounding tissue 3 0 0.25
## uterine leiomyoma 0 5 0.00
Overall, between mono, EBV active, and uterine leiomyoma as well as these two datasets’ healthy samples, there was 75-100% accuracy in classifying the training data in this model.
predict1 <- predict(rf,testing)
results <- data.frame(predicted=predict1, actual=testing$class)
results
## predicted actual
## GSM2279023_AIM AIM AIM
## GSM2279025_CAEBV healthy mono caebv CAEBV
## GSM2279033_healthy healthy mono caebv healthy mono caebv
## GSM2279034_healthy healthy mono caebv healthy mono caebv
## MyoF.UI.10_S7_black UL surrounding tissue UL surrounding tissue
## MyoF.UI.13_S9_black healthy uterine tissue UL surrounding tissue
## MyoN.549_S5_white healthy uterine tissue healthy uterine tissue
## MyoN.UI.43_S3_black UL surrounding tissue healthy uterine tissue
## UF.UI.23_S13_black UL surrounding tissue uterine leiomyoma
For prediction accuracy, we can see that 4/9 were misclassified. But further inspection shows that the UL tissue that is by the tumor is misclassified as healthy, and vice versa, as well as a UL sample of tumor tissue being misclassified as the tissue adjacent to the tumor. And one chronic active EBV sample was misclassified as a healthy sample from that same data set.
We had to use 75% of the data to train and tested the model on the remaining 25% testing set. The results were much better at predicting the class when reduced to two datasets than with the 5 data sets.
We could further get into the Chronic Fatigue Syndrome and Fibromyalgia. Maybe later. We still want to discover some relationships in our EBV associated pathologies of lymphomas and gastrointestinal tract diseases.