We found this study GSE293840 on myalgic encephalamyelitis or ME and also called chronic fatigue syndrome or CFS. We looked at the current gold standard diagnostics criteria with review of etiology, clinical signs and symptoms, diagnosis, and treatment options from the 20th edition of The Merck Manual. Chronic Fatigue syndrome is currently being called SEIM for Systemic Exhertion Intolerance Disease. There has been no proven connections or causality of gene changes from Epstein-Barr Virus (EBV), Lyme disease, or Cytomegalovirus, as well as other viral or bacterial infections. There is a similar comparison in CFS to fibromyalgia in the cognitive decline, pain, and unknown etiology. In CFS the patient has to have been high functioning and then almost out of nowhere seen a dramatic decline in energy usually after a mild infection but this decline lasts more than 6 months, there is an onset of malaise or extreme tiredness after activity such as chores or errands or getting ready to go out, and an orthostatic intolerance. Those are the criteria. There are absolutely no clinical signs or lab work that shows any changes such as elevated ESR, CRP, THS abnormalities, or other findings like peripheral neuropathy. The labs have to be completely normal such as regular glucose and not diabetic. The treatment is only with physical activity in graded exercises and also cognitive behaviour therapy. Those are the only successful improvements in fatigue.

In this study, GSE293840, they have 93 CFS patients and 75 healthy comparisons as controls. But it may not be useable by us in our Tableau dashboard as it is total RNA but only the cell-free RNA or cfRNA. We will look at it now and see if there are any genes we can use for the mRNA genes we have collected in our EBV associated and non-associated pathologies.

library(rmarkdown)
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
data <- read.csv("GSE293840_raw_counts_all.csv.gz")

colnames(data)
##   [1] "X"             "cfs_cfrna_1"   "cfs_cfrna_2"   "cfs_cfrna_3"  
##   [5] "cfs_cfrna_4"   "cfs_cfrna_5"   "cfs_cfrna_6"   "cfs_cfrna_7"  
##   [9] "cfs_cfrna_8"   "cfs_cfrna_11"  "cfs_cfrna_12"  "cfs_cfrna_13" 
##  [13] "cfs_cfrna_14"  "cfs_cfrna_15"  "cfs_cfrna_16"  "cfs_cfrna_17" 
##  [17] "cfs_cfrna_18"  "cfs_cfrna_21"  "cfs_cfrna_22"  "cfs_cfrna_23" 
##  [21] "cfs_cfrna_24"  "cfs_cfrna_25"  "cfs_cfrna_26"  "cfs_cfrna_27" 
##  [25] "cfs_cfrna_28"  "cfs_cfrna_31"  "cfs_cfrna_32"  "cfs_cfrna_33" 
##  [29] "cfs_cfrna_34"  "cfs_cfrna_35"  "cfs_cfrna_36"  "cfs_cfrna_37" 
##  [33] "cfs_cfrna_38"  "cfs_cfrna_41"  "cfs_cfrna_42"  "cfs_cfrna_43" 
##  [37] "cfs_cfrna_44"  "cfs_cfrna_45"  "cfs_cfrna_46"  "cfs_cfrna_47" 
##  [41] "cfs_cfrna_48"  "cfs_cfrna_51"  "cfs_cfrna_52"  "cfs_cfrna_53" 
##  [45] "cfs_cfrna_54"  "cfs_cfrna_55"  "cfs_cfrna_56"  "cfs_cfrna_57" 
##  [49] "cfs_cfrna_58"  "cfs_cfrna_59"  "cfs_cfrna_60"  "cfs_cfrna_63" 
##  [53] "cfs_cfrna_64"  "cfs_cfrna_65"  "cfs_cfrna_66"  "cfs_cfrna_67" 
##  [57] "cfs_cfrna_68"  "cfs_cfrna_69"  "cfs_cfrna_70"  "cfs_cfrna_71" 
##  [61] "cfs_cfrna_72"  "cfs_cfrna_139" "cfs_cfrna_140" "cfs_cfrna_141"
##  [65] "cfs_cfrna_142" "cfs_cfrna_143" "cfs_cfrna_145" "cfs_cfrna_146"
##  [69] "cfs_cfrna_147" "cfs_cfrna_148" "cfs_cfrna_150" "cfs_cfrna_151"
##  [73] "cfs_cfrna_152" "cfs_cfrna_153" "cfs_cfrna_154" "cfs_cfrna_155"
##  [77] "cfs_cfrna_156" "cfs_cfrna_157" "cfs_cfrna_159" "cfs_cfrna_160"
##  [81] "cfs_cfrna_161" "cfs_cfrna_162" "cfs_cfrna_163" "cfs_cfrna_164"
##  [85] "cfs_cfrna_165" "cfs_cfrna_166" "cfs_cfrna_167" "cfs_cfrna_168"
##  [89] "cfs_cfrna_169" "cfs_cfrna_170" "cfs_cfrna_171" "cfs_cfrna_173"
##  [93] "cfs_cfrna_174" "cfs_cfrna_177" "cfs_cfrna_178" "cfs_cfrna_179"
##  [97] "cfs_cfrna_181" "cfs_cfrna_182" "cfs_cfrna_183" "cfs_cfrna_184"
## [101] "cfs_cfrna_185" "cfs_cfrna_186" "cfs_cfrna_187" "cfs_cfrna_188"
## [105] "cfs_cfrna_189" "cfs_cfrna_190" "cfs_cfrna_192" "cfs_cfrna_193"
## [109] "cfs_cfrna_194" "cfs_cfrna_195" "cfs_cfrna_196" "cfs_cfrna_197"
## [113] "cfs_cfrna_198" "cfs_cfrna_199" "cfs_cfrna_200" "cfs_cfrna_201"
## [117] "cfs_cfrna_202" "cfs_cfrna_204" "cfs_cfrna_205" "cfs_cfrna_206"
## [121] "cfs_cfrna_207" "cfs_cfrna_208" "cfs_cfrna_209" "cfs_cfrna_211"
## [125] "cfs_cfrna_212" "cfs_cfrna_213" "cfs_cfrna_214" "cfs_cfrna_215"
## [129] "cfs_cfrna_219" "cfs_cfrna_220" "cfs_cfrna_221" "cfs_cfrna_222"
## [133] "cfs_cfrna_223" "cfs_cfrna_224" "cfs_cfrna_225" "cfs_cfrna_226"
## [137] "cfs_cfrna_230" "cfs_cfrna_231" "cfs_cfrna_232" "cfs_cfrna_233"
## [141] "cfs_cfrna_235" "cfs_cfrna_236" "cfs_cfrna_240" "cfs_cfrna_241"
## [145] "cfs_cfrna_242" "cfs_cfrna_243" "cfs_cfrna_244" "cfs_cfrna_245"
## [149] "cfs_cfrna_246" "cfs_cfrna_247" "cfs_cfrna_248" "cfs_cfrna_251"
## [153] "cfs_cfrna_252" "cfs_cfrna_253" "cfs_cfrna_254" "cfs_cfrna_255"
## [157] "cfs_cfrna_256" "cfs_cfrna_257" "cfs_cfrna_258" "cfs_cfrna_259"
## [161] "cfs_cfrna_260" "cfs_cfrna_264" "cfs_cfrna_265" "cfs_cfrna_266"
## [165] "cfs_cfrna_267" "cfs_cfrna_268" "cfs_cfrna_270" "cfs_cfrna_271"
## [169] "cfs_cfrna_272"

The above samples are all tagged ‘cfs’ but they are mixed with 75 healthy and the rest CFS. The ensembl IDs have transcripts after the known ensembl ID like a transcript variant which is the period followed by a number that shows in this cell-free RNA or cfRNA.

We want to see the sample type as healthy or CFS. The series data should have this information. Also, we can manually copy and paste from within the GSE293840 study information page and make a table in Excel and read that in, manipulate it and get the sample type by the provided ‘cfs_cfrna_’ GSM ID.

series36 <- read.table("GSE293840_series_matrix_folder.txt", nrow=36)

paged_table(series36)
series <- read.table("GSE293840_series_matrix_folder.txt", skip=36, nrow=48)

paged_table(series)
phenotype <- (t(series[12,]))
colnames(phenotype) <- "class"

class <- phenotype[-1,]

class1 <- gsub("phenotype: ", "",class)

class1 <- as.factor(class1)

table(class1)
## class1
##    case control 
##      93      75

There are 93 CFS samples and 75 healthy controls. The phenotype is in the 12th row of the series table of information on this study. We will combine it with the cfs_cfRNA tag in row 20 to get the sample type.

classTag <- series[c(12,20),]

paged_table(classTag)

We don’t need the first column and gsub the ‘phenotype:’ and the ‘Library name:’ to combine the given data header with the sample type.

classTag_t <- data.frame(t(classTag[,-1]))
colnames(classTag_t) <- c("type","sampleID")

headers <- data.frame(type = classTag_t$type, sampleID = classTag_t$sampleID,IDs = colnames(data[,-1]))

paged_table(headers)

Lets combine these with the paste function after modifying the type and IDs columns to remove phenotype: and the cfs_cfrna prepended name.

headers$type <- gsub('phenotype: ','', headers$type)
headers$IDs <- gsub('cfs_cfrna_','',headers$IDs)

headers$labels <- paste(headers$type, headers$ID, sep='_')

paged_table(headers)
headers$class <- as.factor(class1)

paged_table(headers)
headers$class <- gsub('case','Chronic Fatigue Syndrome',headers$class)
headers$class <- gsub('control',"healthy",headers$class)

paged_table(headers)
table(headers$class)
## 
## Chronic Fatigue Syndrome                  healthy 
##                       93                       75

Lets rename the ensembl cfRNA column from ‘X’ to ‘Ensembl_transcript’, and note we will remove that transcript appended copy number variant to get an ensembl_gene field using regex.

colnames(data)[1] <- 'Ensembl_transcript'

colnames(data)[2:169] <- headers$labels

paged_table(data[1:10,])

Now we use regex to remove the period and everything after it in the Ensembl_transcript column so we can make an ensembl_gene column to merge with our gene synonyms from an ensemble dataset of names we have that has less than the genes here of 60,708, the other data set has around 51,000 genes.

data$Ensembl_gene <- gsub("\\..*","",data$Ensembl_transcript)

paged_table(data[c(1:10,60700:60708),c(1,2,170)])

Lets start getting the fold change values of these genes by getting the means of controls and cases or healthy and CFS samples.

controls <- grep('control',colnames(data))
cases <- grep('case',colnames(data))

There are 93 cases and 75 controls like there should be. So now we can use these indices to get the rowMeans by sample type.

data$healthy_mean <- rowMeans(data[,controls])
data$CSF_mean <- rowMeans(data[,cases])

colnames(data)
##   [1] "Ensembl_transcript" "control_1"          "control_2"         
##   [4] "control_3"          "case_4"             "control_5"         
##   [7] "case_6"             "control_7"          "control_8"         
##  [10] "case_11"            "case_12"            "case_13"           
##  [13] "case_14"            "control_15"         "case_16"           
##  [16] "control_17"         "case_18"            "control_21"        
##  [19] "control_22"         "case_23"            "control_24"        
##  [22] "case_25"            "case_26"            "case_27"           
##  [25] "case_28"            "case_31"            "control_32"        
##  [28] "case_33"            "control_34"         "case_35"           
##  [31] "control_36"         "control_37"         "control_38"        
##  [34] "case_41"            "case_42"            "control_43"        
##  [37] "control_44"         "control_45"         "case_46"           
##  [40] "control_47"         "control_48"         "control_51"        
##  [43] "case_52"            "case_53"            "control_54"        
##  [46] "control_55"         "case_56"            "control_57"        
##  [49] "case_58"            "control_59"         "control_60"        
##  [52] "case_63"            "case_64"            "case_65"           
##  [55] "control_66"         "case_67"            "control_68"        
##  [58] "case_69"            "case_70"            "case_71"           
##  [61] "control_72"         "case_139"           "case_140"          
##  [64] "case_141"           "case_142"           "control_143"       
##  [67] "control_145"        "control_146"        "control_147"       
##  [70] "case_148"           "case_150"           "control_151"       
##  [73] "control_152"        "case_153"           "case_154"          
##  [76] "control_155"        "case_156"           "case_157"          
##  [79] "case_159"           "case_160"           "control_161"       
##  [82] "control_162"        "case_163"           "case_164"          
##  [85] "control_165"        "case_166"           "case_167"          
##  [88] "control_168"        "control_169"        "case_170"          
##  [91] "case_171"           "case_173"           "case_174"          
##  [94] "case_177"           "case_178"           "case_179"          
##  [97] "control_181"        "case_182"           "control_183"       
## [100] "control_184"        "control_185"        "case_186"          
## [103] "control_187"        "control_188"        "control_189"       
## [106] "control_190"        "case_192"           "control_193"       
## [109] "control_194"        "control_195"        "case_196"          
## [112] "case_197"           "case_198"           "control_199"       
## [115] "case_200"           "case_201"           "case_202"          
## [118] "case_204"           "case_205"           "case_206"          
## [121] "control_207"        "control_208"        "control_209"       
## [124] "case_211"           "control_212"        "case_213"          
## [127] "control_214"        "control_215"        "case_219"          
## [130] "control_220"        "case_221"           "case_222"          
## [133] "case_223"           "case_224"           "case_225"          
## [136] "case_226"           "case_230"           "case_231"          
## [139] "control_232"        "case_233"           "case_235"          
## [142] "control_236"        "case_240"           "case_241"          
## [145] "case_242"           "control_243"        "control_244"       
## [148] "case_245"           "control_246"        "case_247"          
## [151] "case_248"           "case_251"           "control_252"       
## [154] "case_253"           "case_254"           "control_255"       
## [157] "control_256"        "control_257"        "case_258"          
## [160] "case_259"           "control_260"        "control_264"       
## [163] "case_265"           "case_266"           "control_267"       
## [166] "control_268"        "case_270"           "control_271"       
## [169] "case_272"           "Ensembl_gene"       "healthy_mean"      
## [172] "CSF_mean"

Now we get the fold change values.

data$foldchange_CSF_healthy <- data$healthy_mean/data$CSF_mean

Data <- data[order(data$foldchange_CSF_healthy, decreasing=T),]

Lets remove the Infinites and NaNs and only take values above 0.

DataInf <- Data[!is.infinite(Data$foldchange_CSF_healthy),]
summary(DataInf$foldchange_CSF_healthy) #58,093X173
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.     NAs 
##  0.0000  0.6200  0.8964  0.9754  1.1321 54.5600    7985

There are 7,985 NaNs to remove.

DataNaN <- DataInf[!is.na(DataInf$foldchange_CSF_healthy),]
summary(DataNaN$foldchange_CSF_healthy) #50,108X173
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.6200  0.8964  0.9754  1.1321 54.5600

No more NaNs but we still have 0s. Lets remove those from the data.

DataZeros <- DataNaN[DataNaN$foldchange_CSF_healthy > 0,]
summary(DataZeros$foldchange_CSF_healthy) #45,067X173
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.02313  0.71185  0.93208  1.08446  1.16967 54.56000

Data merge of leftover genes to get gene symbols from other data set.

path <- 'path to your ensemble dataset to merge by gene symbols'

Here is the link to the ensembl data with gene symbols that we will be using probably a lot.

setwd(path)

ensembl <- read.csv("GSE271486_ensembleIDs_NPC_LBMP_study.csv")
paged_table(ensembl[1:5,1:5])
ensembl2 <- ensembl[,1:2]

colnames(ensembl2) #50,868X2
## [1] "gene_id"   "gene_name"

Lets retrieve the gene symbol by merging the ensembl2 dataset by gene_id to the DataZeros Ensembl_gene column

DATA <- merge(ensembl2, DataZeros, by.x='gene_id',by.y='Ensembl_gene')

colnames(DATA) #39,378X174
##   [1] "gene_id"                "gene_name"              "Ensembl_transcript"    
##   [4] "control_1"              "control_2"              "control_3"             
##   [7] "case_4"                 "control_5"              "case_6"                
##  [10] "control_7"              "control_8"              "case_11"               
##  [13] "case_12"                "case_13"                "case_14"               
##  [16] "control_15"             "case_16"                "control_17"            
##  [19] "case_18"                "control_21"             "control_22"            
##  [22] "case_23"                "control_24"             "case_25"               
##  [25] "case_26"                "case_27"                "case_28"               
##  [28] "case_31"                "control_32"             "case_33"               
##  [31] "control_34"             "case_35"                "control_36"            
##  [34] "control_37"             "control_38"             "case_41"               
##  [37] "case_42"                "control_43"             "control_44"            
##  [40] "control_45"             "case_46"                "control_47"            
##  [43] "control_48"             "control_51"             "case_52"               
##  [46] "case_53"                "control_54"             "control_55"            
##  [49] "case_56"                "control_57"             "case_58"               
##  [52] "control_59"             "control_60"             "case_63"               
##  [55] "case_64"                "case_65"                "control_66"            
##  [58] "case_67"                "control_68"             "case_69"               
##  [61] "case_70"                "case_71"                "control_72"            
##  [64] "case_139"               "case_140"               "case_141"              
##  [67] "case_142"               "control_143"            "control_145"           
##  [70] "control_146"            "control_147"            "case_148"              
##  [73] "case_150"               "control_151"            "control_152"           
##  [76] "case_153"               "case_154"               "control_155"           
##  [79] "case_156"               "case_157"               "case_159"              
##  [82] "case_160"               "control_161"            "control_162"           
##  [85] "case_163"               "case_164"               "control_165"           
##  [88] "case_166"               "case_167"               "control_168"           
##  [91] "control_169"            "case_170"               "case_171"              
##  [94] "case_173"               "case_174"               "case_177"              
##  [97] "case_178"               "case_179"               "control_181"           
## [100] "case_182"               "control_183"            "control_184"           
## [103] "control_185"            "case_186"               "control_187"           
## [106] "control_188"            "control_189"            "control_190"           
## [109] "case_192"               "control_193"            "control_194"           
## [112] "control_195"            "case_196"               "case_197"              
## [115] "case_198"               "control_199"            "case_200"              
## [118] "case_201"               "case_202"               "case_204"              
## [121] "case_205"               "case_206"               "control_207"           
## [124] "control_208"            "control_209"            "case_211"              
## [127] "control_212"            "case_213"               "control_214"           
## [130] "control_215"            "case_219"               "control_220"           
## [133] "case_221"               "case_222"               "case_223"              
## [136] "case_224"               "case_225"               "case_226"              
## [139] "case_230"               "case_231"               "control_232"           
## [142] "case_233"               "case_235"               "control_236"           
## [145] "case_240"               "case_241"               "case_242"              
## [148] "control_243"            "control_244"            "case_245"              
## [151] "control_246"            "case_247"               "case_248"              
## [154] "case_251"               "control_252"            "case_253"              
## [157] "case_254"               "control_255"            "control_256"           
## [160] "control_257"            "case_258"               "case_259"              
## [163] "control_260"            "control_264"            "case_265"              
## [166] "case_266"               "control_267"            "control_268"           
## [169] "case_270"               "control_271"            "case_272"              
## [172] "healthy_mean"           "CSF_mean"               "foldchange_CSF_healthy"
DataOrdered <- DATA[order(DATA$foldchange_CSF_healthy, decreasing=T),]

paged_table(DataOrdered[c(1:10,39369:39378),c(1:4,174)])

Write out to csv to add to Tableau dashboard on CFS.

write.csv(DataOrdered,'CFS_data_filtered_ordered_GSE293840.csv',row.names=F)

This data can be found at this link.

Next we will get top 10 upregulated and top 10 down regulated genes and test them with random forest classification to see how well they predict the class as CFS or as healthy.

topGenesCFS <- DataOrdered[c(1:10,39369:39378),]

paged_table(topGenesCFS)

Lets make our matrix to test these genes.

colnames(topGenesCFS)
##   [1] "gene_id"                "gene_name"              "Ensembl_transcript"    
##   [4] "control_1"              "control_2"              "control_3"             
##   [7] "case_4"                 "control_5"              "case_6"                
##  [10] "control_7"              "control_8"              "case_11"               
##  [13] "case_12"                "case_13"                "case_14"               
##  [16] "control_15"             "case_16"                "control_17"            
##  [19] "case_18"                "control_21"             "control_22"            
##  [22] "case_23"                "control_24"             "case_25"               
##  [25] "case_26"                "case_27"                "case_28"               
##  [28] "case_31"                "control_32"             "case_33"               
##  [31] "control_34"             "case_35"                "control_36"            
##  [34] "control_37"             "control_38"             "case_41"               
##  [37] "case_42"                "control_43"             "control_44"            
##  [40] "control_45"             "case_46"                "control_47"            
##  [43] "control_48"             "control_51"             "case_52"               
##  [46] "case_53"                "control_54"             "control_55"            
##  [49] "case_56"                "control_57"             "case_58"               
##  [52] "control_59"             "control_60"             "case_63"               
##  [55] "case_64"                "case_65"                "control_66"            
##  [58] "case_67"                "control_68"             "case_69"               
##  [61] "case_70"                "case_71"                "control_72"            
##  [64] "case_139"               "case_140"               "case_141"              
##  [67] "case_142"               "control_143"            "control_145"           
##  [70] "control_146"            "control_147"            "case_148"              
##  [73] "case_150"               "control_151"            "control_152"           
##  [76] "case_153"               "case_154"               "control_155"           
##  [79] "case_156"               "case_157"               "case_159"              
##  [82] "case_160"               "control_161"            "control_162"           
##  [85] "case_163"               "case_164"               "control_165"           
##  [88] "case_166"               "case_167"               "control_168"           
##  [91] "control_169"            "case_170"               "case_171"              
##  [94] "case_173"               "case_174"               "case_177"              
##  [97] "case_178"               "case_179"               "control_181"           
## [100] "case_182"               "control_183"            "control_184"           
## [103] "control_185"            "case_186"               "control_187"           
## [106] "control_188"            "control_189"            "control_190"           
## [109] "case_192"               "control_193"            "control_194"           
## [112] "control_195"            "case_196"               "case_197"              
## [115] "case_198"               "control_199"            "case_200"              
## [118] "case_201"               "case_202"               "case_204"              
## [121] "case_205"               "case_206"               "control_207"           
## [124] "control_208"            "control_209"            "case_211"              
## [127] "control_212"            "case_213"               "control_214"           
## [130] "control_215"            "case_219"               "control_220"           
## [133] "case_221"               "case_222"               "case_223"              
## [136] "case_224"               "case_225"               "case_226"              
## [139] "case_230"               "case_231"               "control_232"           
## [142] "case_233"               "case_235"               "control_236"           
## [145] "case_240"               "case_241"               "case_242"              
## [148] "control_243"            "control_244"            "case_245"              
## [151] "control_246"            "case_247"               "case_248"              
## [154] "case_251"               "control_252"            "case_253"              
## [157] "case_254"               "control_255"            "control_256"           
## [160] "control_257"            "case_258"               "case_259"              
## [163] "control_260"            "control_264"            "case_265"              
## [166] "case_266"               "control_267"            "control_268"           
## [169] "case_270"               "control_271"            "case_272"              
## [172] "healthy_mean"           "CSF_mean"               "foldchange_CSF_healthy"

Lets make a matrix of the column 4 through 171 for the samples.

length(class1)
## [1] 168
table(class1)
## class1
##    case control 
##      93      75
CFS_mx <- data.frame(t(topGenesCFS[,4:171]))

colnames(CFS_mx) <- topGenesCFS$gene_name

CFS_mx$class <- as.factor(headers$class)

paged_table(CFS_mx)

There are 168 samples, we will randomize our training and testing sets with an 80/20 split with 20% of samples in the hold out validation testing on how well our random forest classifier can predict the class of the sample.

set.seed(567)

inTrain <- sample(1:168,.8*168)

training <- CFS_mx[inTrain,]
testing <- CFS_mx[-inTrain,]

table(training$class)
## 
## Chronic Fatigue Syndrome                  healthy 
##                       74                       60
table(testing$class)
## 
## Chronic Fatigue Syndrome                  healthy 
##                       19                       15
rf_cfs <- randomForest(training[1:20], training$class, mtry=7, ntree=5000, confusion=T, importance=T)

rf_cfs$confusion
##                          Chronic Fatigue Syndrome healthy class.error
## Chronic Fatigue Syndrome                       40      34   0.4594595
## healthy                                        39      21   0.6500000

The training model didn’t really do very well, with CFS at 54% accuracy and healthy as 32% accuracy. Not idea. But the current clinical gold standard of diagnostic medicine says there aren’t any genes to detect CFS. Lets see how well these genes predict the validation set.

prediction_cfs <- predict(rf_cfs,testing)

results <- data.frame(predicted=prediction_cfs, actual=testing$class)

results
##                            predicted                   actual
## control_3   Chronic Fatigue Syndrome                  healthy
## case_4      Chronic Fatigue Syndrome Chronic Fatigue Syndrome
## case_6      Chronic Fatigue Syndrome Chronic Fatigue Syndrome
## case_16     Chronic Fatigue Syndrome Chronic Fatigue Syndrome
## case_27     Chronic Fatigue Syndrome Chronic Fatigue Syndrome
## case_28     Chronic Fatigue Syndrome Chronic Fatigue Syndrome
## control_38                   healthy                  healthy
## control_47                   healthy                  healthy
## control_48  Chronic Fatigue Syndrome                  healthy
## control_54  Chronic Fatigue Syndrome                  healthy
## control_55                   healthy                  healthy
## case_65     Chronic Fatigue Syndrome Chronic Fatigue Syndrome
## control_143                  healthy                  healthy
## control_147                  healthy                  healthy
## case_164    Chronic Fatigue Syndrome Chronic Fatigue Syndrome
## case_166    Chronic Fatigue Syndrome Chronic Fatigue Syndrome
## control_169 Chronic Fatigue Syndrome                  healthy
## case_170    Chronic Fatigue Syndrome Chronic Fatigue Syndrome
## case_178    Chronic Fatigue Syndrome Chronic Fatigue Syndrome
## control_181                  healthy                  healthy
## case_182    Chronic Fatigue Syndrome Chronic Fatigue Syndrome
## control_184 Chronic Fatigue Syndrome                  healthy
## control_194 Chronic Fatigue Syndrome                  healthy
## control_199 Chronic Fatigue Syndrome                  healthy
## case_202                     healthy Chronic Fatigue Syndrome
## case_211    Chronic Fatigue Syndrome Chronic Fatigue Syndrome
## control_214                  healthy                  healthy
## case_221    Chronic Fatigue Syndrome Chronic Fatigue Syndrome
## case_224    Chronic Fatigue Syndrome Chronic Fatigue Syndrome
## case_233    Chronic Fatigue Syndrome Chronic Fatigue Syndrome
## case_241    Chronic Fatigue Syndrome Chronic Fatigue Syndrome
## case_254    Chronic Fatigue Syndrome Chronic Fatigue Syndrome
## control_257 Chronic Fatigue Syndrome                  healthy
## case_270    Chronic Fatigue Syndrome Chronic Fatigue Syndrome

There are 34 samples. Lets see the percentage correct.

correctCFS <- results$predicted == results$actual

correctCFS
##  [1] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE
## [13]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE
## [25] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE
sum(results$predicted == results$actual)/length(results$predicted)
## [1] 0.7352941

These genes scored 73.5% accuracy in the prediction test set.

rf_cfs$importance
##            Chronic Fatigue Syndrome       healthy MeanDecreaseAccuracy
## GNAT3                 -0.0050650902 -7.422873e-03        -5.997699e-03
## AC027288.2             0.0020244968 -2.913264e-03        -1.203892e-04
## Z82196.2               0.0089375213  1.655641e-02         1.224256e-02
## C14orf39               0.0233805554  2.612085e-02         2.434775e-02
## KRTAP19-1             -0.0001339432  3.429759e-05        -5.666362e-05
## CU104787.1            -0.0012834498 -1.689599e-03        -1.484912e-03
## C11orf53               0.0019605824  7.414992e-04         1.343862e-03
## AADACL3                0.0033467671  9.596638e-03         6.118076e-03
## AC097467.2            -0.0002117360  1.784217e-04        -4.318021e-05
## NBEAP3                -0.0005716373 -1.517900e-03        -9.509907e-04
## AC079336.1            -0.0010177135  5.683652e-03         1.778447e-03
## AL034428.1            -0.0009783046 -3.016368e-03        -1.949751e-03
## RPL13AP19              0.0017545578  1.931587e-02         9.214935e-03
## CSN3                  -0.0031400198  8.381599e-04        -1.442605e-03
## FOXCUT                -0.0020523671  1.664360e-02         5.624609e-03
## CSE1L-AS1              0.0001820924  1.960613e-02         8.239398e-03
## AL031717.1            -0.0037435249 -7.021918e-04        -2.721394e-03
## CSN2                  -0.0002992672 -1.854009e-03        -1.077673e-03
## AC009560.1             0.0021382956  1.827984e-02         8.735792e-03
## EYS                   -0.0141633193 -9.358656e-03        -1.230496e-02
##            MeanDecreaseGini
## GNAT3             1.1048733
## AC027288.2        2.2957537
## Z82196.2          2.6263247
## C14orf39          3.7571650
## KRTAP19-1         0.1874902
## CU104787.1        1.6964044
## C11orf53          0.6346821
## AADACL3           1.6879317
## AC097467.2        0.3097577
## NBEAP3            0.6834908
## AC079336.1        1.9825080
## AL034428.1        0.8926938
## RPL13AP19         2.4992189
## CSN3              0.5926663
## FOXCUT            2.2683159
## CSE1L-AS1         2.4553511
## AL031717.1        1.5267586
## CSN2              1.4721648
## AC009560.1        2.5911821
## EYS               2.8215956

That’s not the best but its still good. Lets tune it a little bit and see if better, just alter the ntrees to 10,000 and the mtry to 6.

rf_cfs <- randomForest(training[1:20], training$class, mtry=6, ntree=10000, confusion=T, importance=T)

rf_cfs$confusion
##                          Chronic Fatigue Syndrome healthy class.error
## Chronic Fatigue Syndrome                       40      34   0.4594595
## healthy                                        40      20   0.6666667

The change in tuning parameters seems to be just about the same.

prediction_cfs <- predict(rf_cfs,testing)

results <- data.frame(predicted=prediction_cfs, actual=testing$class)

results
##                            predicted                   actual
## control_3   Chronic Fatigue Syndrome                  healthy
## case_4      Chronic Fatigue Syndrome Chronic Fatigue Syndrome
## case_6      Chronic Fatigue Syndrome Chronic Fatigue Syndrome
## case_16     Chronic Fatigue Syndrome Chronic Fatigue Syndrome
## case_27     Chronic Fatigue Syndrome Chronic Fatigue Syndrome
## case_28     Chronic Fatigue Syndrome Chronic Fatigue Syndrome
## control_38                   healthy                  healthy
## control_47                   healthy                  healthy
## control_48  Chronic Fatigue Syndrome                  healthy
## control_54  Chronic Fatigue Syndrome                  healthy
## control_55                   healthy                  healthy
## case_65     Chronic Fatigue Syndrome Chronic Fatigue Syndrome
## control_143                  healthy                  healthy
## control_147                  healthy                  healthy
## case_164    Chronic Fatigue Syndrome Chronic Fatigue Syndrome
## case_166    Chronic Fatigue Syndrome Chronic Fatigue Syndrome
## control_169 Chronic Fatigue Syndrome                  healthy
## case_170    Chronic Fatigue Syndrome Chronic Fatigue Syndrome
## case_178    Chronic Fatigue Syndrome Chronic Fatigue Syndrome
## control_181                  healthy                  healthy
## case_182    Chronic Fatigue Syndrome Chronic Fatigue Syndrome
## control_184 Chronic Fatigue Syndrome                  healthy
## control_194 Chronic Fatigue Syndrome                  healthy
## control_199 Chronic Fatigue Syndrome                  healthy
## case_202                     healthy Chronic Fatigue Syndrome
## case_211    Chronic Fatigue Syndrome Chronic Fatigue Syndrome
## control_214                  healthy                  healthy
## case_221    Chronic Fatigue Syndrome Chronic Fatigue Syndrome
## case_224    Chronic Fatigue Syndrome Chronic Fatigue Syndrome
## case_233    Chronic Fatigue Syndrome Chronic Fatigue Syndrome
## case_241    Chronic Fatigue Syndrome Chronic Fatigue Syndrome
## case_254    Chronic Fatigue Syndrome Chronic Fatigue Syndrome
## control_257 Chronic Fatigue Syndrome                  healthy
## case_270    Chronic Fatigue Syndrome Chronic Fatigue Syndrome
sum(results$predicted == results$actual)/length(results$predicted)
## [1] 0.7352941

The same accuracy in prediction. So, using fold change values, these are somewhat good genes in predicting a sample as CFS or healthy. We will still add these genes to our pathology database. We will do that at another time.