big28K dataframe LymeDisease

Trying out the randomForest package without the caret package to see how it would do on the lyme disease data that we used earlier that had an imbalanced number of classes. You can read about the randomForest package at this link.

library(randomForest)

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:randomForest':
## 
##     combine

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)

Make a big data of the lyme disease data to test out some functions in the randomForest package.

lyme <- read.csv('lymedisease_meanEqualsMedian99vars.csv', header=T, na.strings=c('',' ','na','NA'))
colnames(lyme)

##  [1] "Gene"                                
##  [2] "healthyControl_1"                    
##  [3] "healthyControl_2"                    
##  [4] "healthyControl_3"                    
##  [5] "healthyControl_4"                    
##  [6] "healthyControl_5"                    
##  [7] "healthyControl_6"                    
##  [8] "healthyControl_7"                    
##  [9] "healthyControl_8"                    
## [10] "healthyControl_9"                    
## [11] "healthyControl_10"                   
## [12] "healthyControl_11"                   
## [13] "healthyControl_12"                   
## [14] "healthyControl_13"                   
## [15] "healthyControl_14"                   
## [16] "healthyControl_15"                   
## [17] "healthyControl_16"                   
## [18] "healthyControl_17"                   
## [19] "healthyControl_18"                   
## [20] "healthyControl_19"                   
## [21] "healthyControl_20"                   
## [22] "healthyControl_21"                   
## [23] "acuteLymeDisease_1"                  
## [24] "acuteLymeDisease_2"                  
## [25] "acuteLymeDisease_3"                  
## [26] "acuteLymeDisease_4"                  
## [27] "acuteLymeDisease_5"                  
## [28] "acuteLymeDisease_6"                  
## [29] "acuteLymeDisease_7"                  
## [30] "acuteLymeDisease_8"                  
## [31] "acuteLymeDisease_9"                  
## [32] "acuteLymeDisease_10"                 
## [33] "acuteLymeDisease_11"                 
## [34] "acuteLymeDisease_12"                 
## [35] "acuteLymeDisease_13"                 
## [36] "acuteLymeDisease_14"                 
## [37] "acuteLymeDisease_15"                 
## [38] "acuteLymeDisease_16"                 
## [39] "acuteLymeDisease_17"                 
## [40] "acuteLymeDisease_18"                 
## [41] "acuteLymeDisease_19"                 
## [42] "acuteLymeDisease_20"                 
## [43] "acuteLymeDisease_21"                 
## [44] "acuteLymeDisease_22"                 
## [45] "acuteLymeDisease_23"                 
## [46] "acuteLymeDisease_24"                 
## [47] "acuteLymeDisease_25"                 
## [48] "acuteLymeDisease_26"                 
## [49] "acuteLymeDisease_27"                 
## [50] "acuteLymeDisease_28"                 
## [51] "Antibodies_1month_1"                 
## [52] "Antibodies_1month_2"                 
## [53] "Antibodies_1month_3"                 
## [54] "Antibodies_1month_4"                 
## [55] "Antibodies_1month_5"                 
## [56] "Antibodies_1month_6"                 
## [57] "Antibodies_1month_7"                 
## [58] "Antibodies_1month_8"                 
## [59] "Antibodies_1month_9"                 
## [60] "Antibodies_1month_10"                
## [61] "Antibodies_1month_11"                
## [62] "Antibodies_1month_12"                
## [63] "Antibodies_1month_13"                
## [64] "Antibodies_1month_14"                
## [65] "Antibodies_1month_15"                
## [66] "Antibodies_1month_16"                
## [67] "Antibodies_1month_17"                
## [68] "Antibodies_1month_18"                
## [69] "Antibodies_1month_19"                
## [70] "Antibodies_1month_20"                
## [71] "Antibodies_1month_21"                
## [72] "Antibodies_1month_22"                
## [73] "Antibodies_1month_23"                
## [74] "Antibodies_1month_24"                
## [75] "Antibodies_1month_25"                
## [76] "Antibodies_1month_26"                
## [77] "Antibodies_1month_27"                
## [78] "Antibodies_6months_1"                
## [79] "Antibodies_6months_2"                
## [80] "Antibodies_6months_3"                
## [81] "Antibodies_6months_4"                
## [82] "Antibodies_6months_5"                
## [83] "Antibodies_6months_6"                
## [84] "Antibodies_6months_7"                
## [85] "Antibodies_6months_8"                
## [86] "Antibodies_6months_9"                
## [87] "Antibodies_6months_10"               
## [88] "healthy_Mean"                        
## [89] "acuteLymeDisease_Mean"               
## [90] "antibodies_1month_Mean"              
## [91] "antibodies_6month_Mean"              
## [92] "acuteHealthy_foldChange"             
## [93] "antibodies_1month_healthy_foldChange"
## [94] "antibodies_6month_healthy_foldchange"
## [95] "acute_vs_6month_foldchange"          
## [96] "healthy_Median"                      
## [97] "acuteLymeDisease_Median"             
## [98] "antibodies_1month_Median"            
## [99] "antibodies_6month_Median"

Locate the file we just used here.

lymeClassOnly <- lyme[,1:87]
colnames(lymeClassOnly)

##  [1] "Gene"                  "healthyControl_1"      "healthyControl_2"     
##  [4] "healthyControl_3"      "healthyControl_4"      "healthyControl_5"     
##  [7] "healthyControl_6"      "healthyControl_7"      "healthyControl_8"     
## [10] "healthyControl_9"      "healthyControl_10"     "healthyControl_11"    
## [13] "healthyControl_12"     "healthyControl_13"     "healthyControl_14"    
## [16] "healthyControl_15"     "healthyControl_16"     "healthyControl_17"    
## [19] "healthyControl_18"     "healthyControl_19"     "healthyControl_20"    
## [22] "healthyControl_21"     "acuteLymeDisease_1"    "acuteLymeDisease_2"   
## [25] "acuteLymeDisease_3"    "acuteLymeDisease_4"    "acuteLymeDisease_5"   
## [28] "acuteLymeDisease_6"    "acuteLymeDisease_7"    "acuteLymeDisease_8"   
## [31] "acuteLymeDisease_9"    "acuteLymeDisease_10"   "acuteLymeDisease_11"  
## [34] "acuteLymeDisease_12"   "acuteLymeDisease_13"   "acuteLymeDisease_14"  
## [37] "acuteLymeDisease_15"   "acuteLymeDisease_16"   "acuteLymeDisease_17"  
## [40] "acuteLymeDisease_18"   "acuteLymeDisease_19"   "acuteLymeDisease_20"  
## [43] "acuteLymeDisease_21"   "acuteLymeDisease_22"   "acuteLymeDisease_23"  
## [46] "acuteLymeDisease_24"   "acuteLymeDisease_25"   "acuteLymeDisease_26"  
## [49] "acuteLymeDisease_27"   "acuteLymeDisease_28"   "Antibodies_1month_1"  
## [52] "Antibodies_1month_2"   "Antibodies_1month_3"   "Antibodies_1month_4"  
## [55] "Antibodies_1month_5"   "Antibodies_1month_6"   "Antibodies_1month_7"  
## [58] "Antibodies_1month_8"   "Antibodies_1month_9"   "Antibodies_1month_10" 
## [61] "Antibodies_1month_11"  "Antibodies_1month_12"  "Antibodies_1month_13" 
## [64] "Antibodies_1month_14"  "Antibodies_1month_15"  "Antibodies_1month_16" 
## [67] "Antibodies_1month_17"  "Antibodies_1month_18"  "Antibodies_1month_19" 
## [70] "Antibodies_1month_20"  "Antibodies_1month_21"  "Antibodies_1month_22" 
## [73] "Antibodies_1month_23"  "Antibodies_1month_24"  "Antibodies_1month_25" 
## [76] "Antibodies_1month_26"  "Antibodies_1month_27"  "Antibodies_6months_1" 
## [79] "Antibodies_6months_2"  "Antibodies_6months_3"  "Antibodies_6months_4" 
## [82] "Antibodies_6months_5"  "Antibodies_6months_6"  "Antibodies_6months_7" 
## [85] "Antibodies_6months_8"  "Antibodies_6months_9"  "Antibodies_6months_10"

classNames <- colnames(lymeClassOnly)

classNames

##  [1] "Gene"                  "healthyControl_1"      "healthyControl_2"     
##  [4] "healthyControl_3"      "healthyControl_4"      "healthyControl_5"     
##  [7] "healthyControl_6"      "healthyControl_7"      "healthyControl_8"     
## [10] "healthyControl_9"      "healthyControl_10"     "healthyControl_11"    
## [13] "healthyControl_12"     "healthyControl_13"     "healthyControl_14"    
## [16] "healthyControl_15"     "healthyControl_16"     "healthyControl_17"    
## [19] "healthyControl_18"     "healthyControl_19"     "healthyControl_20"    
## [22] "healthyControl_21"     "acuteLymeDisease_1"    "acuteLymeDisease_2"   
## [25] "acuteLymeDisease_3"    "acuteLymeDisease_4"    "acuteLymeDisease_5"   
## [28] "acuteLymeDisease_6"    "acuteLymeDisease_7"    "acuteLymeDisease_8"   
## [31] "acuteLymeDisease_9"    "acuteLymeDisease_10"   "acuteLymeDisease_11"  
## [34] "acuteLymeDisease_12"   "acuteLymeDisease_13"   "acuteLymeDisease_14"  
## [37] "acuteLymeDisease_15"   "acuteLymeDisease_16"   "acuteLymeDisease_17"  
## [40] "acuteLymeDisease_18"   "acuteLymeDisease_19"   "acuteLymeDisease_20"  
## [43] "acuteLymeDisease_21"   "acuteLymeDisease_22"   "acuteLymeDisease_23"  
## [46] "acuteLymeDisease_24"   "acuteLymeDisease_25"   "acuteLymeDisease_26"  
## [49] "acuteLymeDisease_27"   "acuteLymeDisease_28"   "Antibodies_1month_1"  
## [52] "Antibodies_1month_2"   "Antibodies_1month_3"   "Antibodies_1month_4"  
## [55] "Antibodies_1month_5"   "Antibodies_1month_6"   "Antibodies_1month_7"  
## [58] "Antibodies_1month_8"   "Antibodies_1month_9"   "Antibodies_1month_10" 
## [61] "Antibodies_1month_11"  "Antibodies_1month_12"  "Antibodies_1month_13" 
## [64] "Antibodies_1month_14"  "Antibodies_1month_15"  "Antibodies_1month_16" 
## [67] "Antibodies_1month_17"  "Antibodies_1month_18"  "Antibodies_1month_19" 
## [70] "Antibodies_1month_20"  "Antibodies_1month_21"  "Antibodies_1month_22" 
## [73] "Antibodies_1month_23"  "Antibodies_1month_24"  "Antibodies_1month_25" 
## [76] "Antibodies_1month_26"  "Antibodies_1month_27"  "Antibodies_6months_1" 
## [79] "Antibodies_6months_2"  "Antibodies_6months_3"  "Antibodies_6months_4" 
## [82] "Antibodies_6months_5"  "Antibodies_6months_6"  "Antibodies_6months_7" 
## [85] "Antibodies_6months_8"  "Antibodies_6months_9"  "Antibodies_6months_10"

healthyIndex <- grep('healthy',classNames)
acute <- grep('acute', classNames)
month1 <- grep('1month',classNames)
month6 <- grep('6month', classNames)

classNames[healthyIndex] <- 'healthy'
classNames[acute] <- 'acute'
classNames[month1] <- '1 month'
classNames[month6] <- '6 month'
classNames

##  [1] "Gene"    "healthy" "healthy" "healthy" "healthy" "healthy" "healthy"
##  [8] "healthy" "healthy" "healthy" "healthy" "healthy" "healthy" "healthy"
## [15] "healthy" "healthy" "healthy" "healthy" "healthy" "healthy" "healthy"
## [22] "healthy" "acute"   "acute"   "acute"   "acute"   "acute"   "acute"  
## [29] "acute"   "acute"   "acute"   "acute"   "acute"   "acute"   "acute"  
## [36] "acute"   "acute"   "acute"   "acute"   "acute"   "acute"   "acute"  
## [43] "acute"   "acute"   "acute"   "acute"   "acute"   "acute"   "acute"  
## [50] "acute"   "1 month" "1 month" "1 month" "1 month" "1 month" "1 month"
## [57] "1 month" "1 month" "1 month" "1 month" "1 month" "1 month" "1 month"
## [64] "1 month" "1 month" "1 month" "1 month" "1 month" "1 month" "1 month"
## [71] "1 month" "1 month" "1 month" "1 month" "1 month" "1 month" "1 month"
## [78] "6 month" "6 month" "6 month" "6 month" "6 month" "6 month" "6 month"
## [85] "6 month" "6 month" "6 month"

lymeByClass <- lymeClassOnly
colnames(lymeByClass) <- classNames
colnames(lymeByClass)

##  [1] "Gene"    "healthy" "healthy" "healthy" "healthy" "healthy" "healthy"
##  [8] "healthy" "healthy" "healthy" "healthy" "healthy" "healthy" "healthy"
## [15] "healthy" "healthy" "healthy" "healthy" "healthy" "healthy" "healthy"
## [22] "healthy" "acute"   "acute"   "acute"   "acute"   "acute"   "acute"  
## [29] "acute"   "acute"   "acute"   "acute"   "acute"   "acute"   "acute"  
## [36] "acute"   "acute"   "acute"   "acute"   "acute"   "acute"   "acute"  
## [43] "acute"   "acute"   "acute"   "acute"   "acute"   "acute"   "acute"  
## [50] "acute"   "1 month" "1 month" "1 month" "1 month" "1 month" "1 month"
## [57] "1 month" "1 month" "1 month" "1 month" "1 month" "1 month" "1 month"
## [64] "1 month" "1 month" "1 month" "1 month" "1 month" "1 month" "1 month"
## [71] "1 month" "1 month" "1 month" "1 month" "1 month" "1 month" "1 month"
## [78] "6 month" "6 month" "6 month" "6 month" "6 month" "6 month" "6 month"
## [85] "6 month" "6 month" "6 month"

lymeByClass1 <- lymeByClass[complete.cases(lymeByClass),]
dim(lymeByClass1)

## [1] 19525    87

Only one case wasn’t complete so it was removed from the genes.

Make all genes as features as a very wide or fat dataframe that can really slow or shut down some processes.

big20kDF <- data.frame(t(lymeByClass))
#head(big20kDF) #this printed out a very long document had to remove it
big20kDF[1:5,1:5]

##                      X1            X2            X3            X4            X5
## Gene               A1BG          A1CF         A2BP1         A2LD1           A2M
## healthy   -1.517870e-01  1.005733e-02  1.748806e-01 -3.287320e-01 -4.190273e-01
## healthy.1 -0.1186983600 -0.4603165350  0.1099771518 -0.2406292000  0.4414801600
## healthy.2  0.0801501300  0.4677242000  0.5472002982 -0.2155766500  1.1160240000
## healthy.3 -3.132427e-01  1.544443e-01 -1.674776e-01 -3.081846e-01  1.927662e-01

colnames(big20kDF) <-lymeByClass$Gene
Big20K_df <- big20kDF[-1,]
#head(Big20K_df)
Big20K_df[1:5,1:5]

##                    A1BG          A1CF         A2BP1         A2LD1           A2M
## healthy   -1.517870e-01  1.005733e-02  1.748806e-01 -3.287320e-01 -4.190273e-01
## healthy.1 -0.1186983600 -0.4603165350  0.1099771518 -0.2406292000  0.4414801600
## healthy.2  0.0801501300  0.4677242000  0.5472002982 -0.2155766500  1.1160240000
## healthy.3 -3.132427e-01  1.544443e-01 -1.674776e-01 -3.081846e-01  1.927662e-01
## healthy.4 -4.757977e-02 -6.056571e-02 -1.134319e-01 -1.514206e-01  1.730504e-01

dim(Big20K_df)

## [1]    86 19526

Big20K_df$class <- classNames[2:87]
Big20K_df[1:5,1:5]

##                    A1BG          A1CF         A2BP1         A2LD1           A2M
## healthy   -1.517870e-01  1.005733e-02  1.748806e-01 -3.287320e-01 -4.190273e-01
## healthy.1 -0.1186983600 -0.4603165350  0.1099771518 -0.2406292000  0.4414801600
## healthy.2  0.0801501300  0.4677242000  0.5472002982 -0.2155766500  1.1160240000
## healthy.3 -3.132427e-01  1.544443e-01 -1.674776e-01 -3.081846e-01  1.927662e-01
## healthy.4 -4.757977e-02 -6.056571e-02 -1.134319e-01 -1.514206e-01  1.730504e-01

write.csv(Big20K_df,'fat20k_Lyme_DF.csv', row.names=F)

Google sheets said file is too large to upload. Was able to write it out but it failed on a ‘function’ mode when reading in.

We will just try one run of the randomForest functions to see if it can pull up most important features with this very large data frame.

acute6monthFC <- lyme[order(lyme$acute_vs_6month_foldchange, decreasing=T),]
colnames(acute6monthFC)

##  [1] "Gene"                                
##  [2] "healthyControl_1"                    
##  [3] "healthyControl_2"                    
##  [4] "healthyControl_3"                    
##  [5] "healthyControl_4"                    
##  [6] "healthyControl_5"                    
##  [7] "healthyControl_6"                    
##  [8] "healthyControl_7"                    
##  [9] "healthyControl_8"                    
## [10] "healthyControl_9"                    
## [11] "healthyControl_10"                   
## [12] "healthyControl_11"                   
## [13] "healthyControl_12"                   
## [14] "healthyControl_13"                   
## [15] "healthyControl_14"                   
## [16] "healthyControl_15"                   
## [17] "healthyControl_16"                   
## [18] "healthyControl_17"                   
## [19] "healthyControl_18"                   
## [20] "healthyControl_19"                   
## [21] "healthyControl_20"                   
## [22] "healthyControl_21"                   
## [23] "acuteLymeDisease_1"                  
## [24] "acuteLymeDisease_2"                  
## [25] "acuteLymeDisease_3"                  
## [26] "acuteLymeDisease_4"                  
## [27] "acuteLymeDisease_5"                  
## [28] "acuteLymeDisease_6"                  
## [29] "acuteLymeDisease_7"                  
## [30] "acuteLymeDisease_8"                  
## [31] "acuteLymeDisease_9"                  
## [32] "acuteLymeDisease_10"                 
## [33] "acuteLymeDisease_11"                 
## [34] "acuteLymeDisease_12"                 
## [35] "acuteLymeDisease_13"                 
## [36] "acuteLymeDisease_14"                 
## [37] "acuteLymeDisease_15"                 
## [38] "acuteLymeDisease_16"                 
## [39] "acuteLymeDisease_17"                 
## [40] "acuteLymeDisease_18"                 
## [41] "acuteLymeDisease_19"                 
## [42] "acuteLymeDisease_20"                 
## [43] "acuteLymeDisease_21"                 
## [44] "acuteLymeDisease_22"                 
## [45] "acuteLymeDisease_23"                 
## [46] "acuteLymeDisease_24"                 
## [47] "acuteLymeDisease_25"                 
## [48] "acuteLymeDisease_26"                 
## [49] "acuteLymeDisease_27"                 
## [50] "acuteLymeDisease_28"                 
## [51] "Antibodies_1month_1"                 
## [52] "Antibodies_1month_2"                 
## [53] "Antibodies_1month_3"                 
## [54] "Antibodies_1month_4"                 
## [55] "Antibodies_1month_5"                 
## [56] "Antibodies_1month_6"                 
## [57] "Antibodies_1month_7"                 
## [58] "Antibodies_1month_8"                 
## [59] "Antibodies_1month_9"                 
## [60] "Antibodies_1month_10"                
## [61] "Antibodies_1month_11"                
## [62] "Antibodies_1month_12"                
## [63] "Antibodies_1month_13"                
## [64] "Antibodies_1month_14"                
## [65] "Antibodies_1month_15"                
## [66] "Antibodies_1month_16"                
## [67] "Antibodies_1month_17"                
## [68] "Antibodies_1month_18"                
## [69] "Antibodies_1month_19"                
## [70] "Antibodies_1month_20"                
## [71] "Antibodies_1month_21"                
## [72] "Antibodies_1month_22"                
## [73] "Antibodies_1month_23"                
## [74] "Antibodies_1month_24"                
## [75] "Antibodies_1month_25"                
## [76] "Antibodies_1month_26"                
## [77] "Antibodies_1month_27"                
## [78] "Antibodies_6months_1"                
## [79] "Antibodies_6months_2"                
## [80] "Antibodies_6months_3"                
## [81] "Antibodies_6months_4"                
## [82] "Antibodies_6months_5"                
## [83] "Antibodies_6months_6"                
## [84] "Antibodies_6months_7"                
## [85] "Antibodies_6months_8"                
## [86] "Antibodies_6months_9"                
## [87] "Antibodies_6months_10"               
## [88] "healthy_Mean"                        
## [89] "acuteLymeDisease_Mean"               
## [90] "antibodies_1month_Mean"              
## [91] "antibodies_6month_Mean"              
## [92] "acuteHealthy_foldChange"             
## [93] "antibodies_1month_healthy_foldChange"
## [94] "antibodies_6month_healthy_foldchange"
## [95] "acute_vs_6month_foldchange"          
## [96] "healthy_Median"                      
## [97] "acuteLymeDisease_Median"             
## [98] "antibodies_1month_Median"            
## [99] "antibodies_6month_Median"

Take top40 and bottom 40 genes arrange from fold change values of the acute samples vs the samples already infected 6 months.

acuteVs6monthTop40 <- acute6monthFC[c(1:40,19487:19526),c(1:87,95)]
head(acuteVs6monthTop40$acute_vs_6month_foldchange,40)

##  [1] 39592.5065 26949.4605  3013.6402  2340.4274  2252.7837  1667.6541
##  [7]  1667.5040  1390.4131  1087.7278   959.0077   955.0899   819.7076
## [13]   814.2625   784.6073   757.5879   687.6496   660.1728   636.0464
## [19]   592.8499   579.0681   568.3394   556.5235   507.4635   502.1060
## [25]   415.4412   401.5336   385.7999   380.0250   376.6654   320.0938
## [31]   317.9930   313.9046   245.4455   214.5356   203.8399   200.7999
## [37]   192.8368   184.9654   180.9991   175.9187

tail(acuteVs6monthTop40$acute_vs_6month_foldchange,40)

##  [1]  -138.4930  -147.4067  -147.4130  -148.3167  -157.0114  -157.8041
##  [7]  -159.4937  -165.7433  -172.9144  -178.5105  -188.8205  -197.4117
## [13]  -197.8417  -219.1352  -220.8985  -221.6484  -225.3240  -235.9024
## [19]  -258.8922  -260.3979  -268.6351  -312.2450  -377.6045  -423.9474
## [25]  -449.6177  -496.0992  -543.2492  -583.6196  -589.4268  -732.5246
## [31]  -818.2834 -1111.6293 -1166.7017 -1202.1031 -1208.8211 -1363.4545
## [37] -1473.4662 -3152.9187 -4837.2160 -5134.0986

acute6monthDF <- acuteVs6monthTop40[,-88]
genes80 <- acute6monthDF$Gene
genes80

##  [1] "NEK6"            "HECTD1"          "PDE4DIP"         "POU2F2"         
##  [5] "TMX3"            "KAT5"            "PLCXD1"          "MPDU1"          
##  [9] "PCSK1N"          "TCEAL4"          "KRT36"           "HSD11B1"        
## [13] "LOC100287314"    "CHGA"            "STK38L"          "BRD3"           
## [17] "TMSB4Y"          "GGT7"            "RNF217"          "ALAS2"          
## [21] "LILRA6"          "ADRBK2"          "TAS2R3"          "C1orf51"        
## [25] "CFTR"            "HS3ST3B1"        "LRRCC1"          "MPO"            
## [29] "LOC653562"       "TPBG"            "SCN11A"          "VSIG8"          
## [33] "CD2AP"           "FLJ23867"        "BEGAIN"          "MGC34800"       
## [37] "P2RX7"           "WDR46"           "GSC"             "C3orf42"        
## [41] "MTMR14"          "FBXO48"          "H2AFJ"           "KRTAP10-7"      
## [45] "SNORA74A"        "RASAL1"          "ENSG00000229196" "FBXO9"          
## [49] "SOX10"           "C1orf127"        "F13A1"           "OR10A3"         
## [53] "LOC145757"       "FAT2"            "OOEP"            "EIF2C2"         
## [57] "MMP13"           "MAP2K3"          "BRSK1"           "RHOBTB2"        
## [61] "Septin 6"        "STRA8"           "ATP2C1"          "OR14A16"        
## [65] "AHSG"            "ORC6L"           "TLX1NB"          "MTHFD2"         
## [69] "GMPPB"           "RGPD3"           "FXR1"            "GJA5"           
## [73] "ZNF836"          "C16orf48"        "DNAJC24"         "TSHR"           
## [77] "MTA3"            "NDST3"           "OR12D2"          "MKL1"

DF80 <- data.frame(t(acute6monthDF))
colnames(DF80) <- genes80
DF80$class <- classNames
DF80_ML <- DF80[-1,]

write.csv(DF80_ML, 'ML_lyme_DF_86X81.csv', row.names=F)

You can access this file here.

clear environment of all variables.

rm(list=ls())

Read in machine learning data frame of 80 genes as features to predict class of 86 classes. This will prevent the rows from having names.

ML80 <- read.csv('ML_lyme_DF_86X81.csv', header=T, na.strings=c('',' ','na','NA'))

Lets use the cross validation function in randomForest to get important features.

str(ML80)

## 'data.frame':    86 obs. of  81 variables:
##  $ NEK6           : num  -0.3508 -0.1491 -0.0436 0.5418 -0.4119 ...
##  $ HECTD1         : num  -0.302 0.521 -1.802 0.547 0.531 ...
##  $ PDE4DIP        : num  -0.5338 -0.111 0.04 -0.0543 0.0395 ...
##  $ POU2F2         : num  -0.305 -0.258 -0.0195 -0.0769 -0.1503 ...
##  $ TMX3           : num  -0.0214 0.2596 -0.61 0.234 0.4109 ...
##  $ KAT5           : num  0.2 0.434 0.377 0.37 -0.159 ...
##  $ PLCXD1         : num  -0.5578 -0.5126 -0.334 0.0917 -0.4348 ...
##  $ MPDU1          : num  -0.31 -0.444 -0.122 -0.09 -0.354 ...
##  $ PCSK1N         : num  -0.0774 -0.0476 0.7991 -0.0582 -0.3024 ...
##  $ TCEAL4         : num  0.5637 -0.318 -1.2768 0.282 0.0424 ...
##  $ KRT36          : num  -0.05664 -0.199899 0.185492 0.000144 -0.155894 ...
##  $ HSD11B1        : num  -0.0724 0.0489 0.1243 -0.0985 -0.1971 ...
##  $ LOC100287314   : num  -0.1245 0.1503 -0.1013 0.0786 -0.1708 ...
##  $ CHGA           : num  0.0631 -0.2733 1.2062 -0.2496 -0.2776 ...
##  $ STK38L         : num  -0.7317 0.0727 -0.3871 0.3623 0.0709 ...
##  $ BRD3           : num  0.0203 0.2553 0.126 0.313 0.4716 ...
##  $ TMSB4Y         : num  0.3061 -0.5535 0.0623 -0.2572 0.2259 ...
##  $ GGT7           : num  -0.156 0.111 -0.0523 -0.4699 0.2977 ...
##  $ RNF217         : num  -0.315 1.019 -0.152 -0.205 0.305 ...
##  $ ALAS2          : num  0.14 2.008 -0.517 -0.554 -0.762 ...
##  $ LILRA6         : num  -0.571 -0.804 -1.225 -0.696 -0.374 ...
##  $ ADRBK2         : num  -0.7166 -0.0239 -0.5431 0.0757 -0.2658 ...
##  $ TAS2R3         : num  0.2325 0.1031 -0.036 -0.0213 -0.2399 ...
##  $ C1orf51        : num  0.1973 -0.1743 0.2388 0.1867 -0.0457 ...
##  $ CFTR           : num  0.25 -0.272 0.751 -0.632 -0.32 ...
##  $ HS3ST3B1       : num  -0.336 0.323 0.194 0.392 0.282 ...
##  $ LRRCC1         : num  -0.0765 -0.0728 -0.0306 0.1316 0.0904 ...
##  $ MPO            : num  -0.0461 -0.3983 -0.0862 -0.0156 -0.249 ...
##  $ LOC653562      : num  -0.158 1.295 0.837 -0.192 -0.407 ...
##  $ TPBG           : num  -0.287 0.947 0.276 -0.4 -0.591 ...
##  $ SCN11A         : num  -0.0107 0.0328 1.0342 -0.0841 -0.1941 ...
##  $ VSIG8          : num  0.627 -0.198 -0.332 0.423 -0.334 ...
##  $ CD2AP          : num  -0.0392 -0.1189 0.0636 0.1281 0.119 ...
##  $ FLJ23867       : num  -0.4395 0.4936 -0.0292 -0.3027 0.0265 ...
##  $ BEGAIN         : num  -0.3016 -0.1186 1.0694 -0.0139 -0.2875 ...
##  $ MGC34800       : num  -0.0912 -0.2683 -0.0734 -0.019 -0.2323 ...
##  $ P2RX7          : num  -0.699 0.453 -0.534 -0.214 0.141 ...
##  $ WDR46          : num  0.0299 -0.1665 -0.2608 -0.1359 -0.1444 ...
##  $ GSC            : num  0.3248 -0.1677 0.407 0.0255 0.3996 ...
##  $ C3orf42        : num  0.0669 -0.1948 0.4071 0.0616 -0.4221 ...
##  $ MTMR14         : num  -0.0949 -0.0122 -0.1157 0.1285 -0.0997 ...
##  $ FBXO48         : num  0.173 -0.098 -0.393 0.239 -0.116 ...
##  $ H2AFJ          : num  -0.633 -0.0892 -0.715 0.1485 0.3033 ...
##  $ KRTAP10.7      : num  0.734 0.257 1.111 -0.148 -0.135 ...
##  $ SNORA74A       : num  0.676 0.139 0.872 0.114 -0.38 ...
##  $ RASAL1         : num  0.11742 -0.00149 -0.02482 -0.23536 0.028 ...
##  $ ENSG00000229196: num  0.75822 -0.00579 0.93324 -0.02652 0.18702 ...
##  $ FBXO9          : num  0.0192 -0.0072 -0.6633 -0.0984 0.025 ...
##  $ SOX10          : num  0.0862 -0.1704 -0.1274 -0.0798 -0.3639 ...
##  $ C1orf127       : num  0.246 0.336 1.465 -0.343 -0.128 ...
##  $ F13A1          : num  -0.605 0.36 -0.218 -0.598 -0.223 ...
##  $ OR10A3         : num  -0.06915 0.05162 0.24496 0.00137 -0.28854 ...
##  $ LOC145757      : num  -0.0569 -0.04769 0.00873 -0.09158 -0.1248 ...
##  $ FAT2           : num  0.1919 -0.3219 0.3047 -0.0593 -0.1934 ...
##  $ OOEP           : num  0.177491 -0.000566 0.668561 0.199253 -0.173194 ...
##  $ EIF2C2         : num  0.4964 0.2373 -0.7133 0.3628 -0.0516 ...
##  $ MMP13          : num  0.0434 -0.0802 0.2907 -0.0312 -0.1319 ...
##  $ MAP2K3         : num  -0.7315 0.0363 0.1161 -0.0728 -0.1372 ...
##  $ BRSK1          : num  -0.604 0.0353 0.3551 -0.164 -0.3456 ...
##  $ RHOBTB2        : num  0.1453 0.1134 0.6744 0.1848 0.0606 ...
##  $ Septin.6       : num  -0.1437 -0.3411 -0.0162 -0.4229 -0.0351 ...
##  $ STRA8          : num  0.8642 0.0254 0.3683 0.3999 -0.0766 ...
##  $ ATP2C1         : num  -0.361 0.125 -0.659 0.119 0.149 ...
##  $ OR14A16        : num  0.0286 -0.0641 0.4736 0.0261 -0.3375 ...
##  $ AHSG           : num  0.192 -0.22 0.92 -0.15 -0.277 ...
##  $ ORC6L          : num  0.19 -1.369 -0.131 -0.742 0.143 ...
##  $ TLX1NB         : num  0.502 -0.256 0.807 -0.475 -0.145 ...
##  $ MTHFD2         : num  0.35248 0.31067 -1.05314 -0.01933 0.00216 ...
##  $ GMPPB          : num  0.0119 0.3596 0.5872 0.3397 0.0886 ...
##  $ RGPD3          : num  -0.000246 0.262285 -0.040798 -0.375937 0.190607 ...
##  $ FXR1           : num  0.012 0.4993 -0.3251 0.0827 0.3152 ...
##  $ GJA5           : num  0.3345 0.1924 0.4793 -0.0485 -0.1555 ...
##  $ ZNF836         : num  0.0683 -0.0257 0.1888 -0.057 -0.1004 ...
##  $ C16orf48       : num  0.2156 -0.0478 0.4009 0.3981 -0.21 ...
##  $ DNAJC24        : num  -0.4986 0.0249 -0.5918 0.3067 -0.0755 ...
##  $ TSHR           : num  -0.056 0.1009 0.3179 0.0243 -0.0942 ...
##  $ MTA3           : num  0.239 -0.491 -0.483 0.221 0.871 ...
##  $ NDST3          : num  -0.0178 0.0633 0.1542 -0.064 -0.0408 ...
##  $ OR12D2         : num  0.2254 -0.1449 0.066 0.135 -0.0408 ...
##  $ MKL1           : num  -0.33057 0.12237 0.00884 0.07873 -0.08919 ...
##  $ class          : chr  "healthy" "healthy" "healthy" "healthy" ...

ML80$class <- as.factor(ML80$class)
str(ML80)

## 'data.frame':    86 obs. of  81 variables:
##  $ NEK6           : num  -0.3508 -0.1491 -0.0436 0.5418 -0.4119 ...
##  $ HECTD1         : num  -0.302 0.521 -1.802 0.547 0.531 ...
##  $ PDE4DIP        : num  -0.5338 -0.111 0.04 -0.0543 0.0395 ...
##  $ POU2F2         : num  -0.305 -0.258 -0.0195 -0.0769 -0.1503 ...
##  $ TMX3           : num  -0.0214 0.2596 -0.61 0.234 0.4109 ...
##  $ KAT5           : num  0.2 0.434 0.377 0.37 -0.159 ...
##  $ PLCXD1         : num  -0.5578 -0.5126 -0.334 0.0917 -0.4348 ...
##  $ MPDU1          : num  -0.31 -0.444 -0.122 -0.09 -0.354 ...
##  $ PCSK1N         : num  -0.0774 -0.0476 0.7991 -0.0582 -0.3024 ...
##  $ TCEAL4         : num  0.5637 -0.318 -1.2768 0.282 0.0424 ...
##  $ KRT36          : num  -0.05664 -0.199899 0.185492 0.000144 -0.155894 ...
##  $ HSD11B1        : num  -0.0724 0.0489 0.1243 -0.0985 -0.1971 ...
##  $ LOC100287314   : num  -0.1245 0.1503 -0.1013 0.0786 -0.1708 ...
##  $ CHGA           : num  0.0631 -0.2733 1.2062 -0.2496 -0.2776 ...
##  $ STK38L         : num  -0.7317 0.0727 -0.3871 0.3623 0.0709 ...
##  $ BRD3           : num  0.0203 0.2553 0.126 0.313 0.4716 ...
##  $ TMSB4Y         : num  0.3061 -0.5535 0.0623 -0.2572 0.2259 ...
##  $ GGT7           : num  -0.156 0.111 -0.0523 -0.4699 0.2977 ...
##  $ RNF217         : num  -0.315 1.019 -0.152 -0.205 0.305 ...
##  $ ALAS2          : num  0.14 2.008 -0.517 -0.554 -0.762 ...
##  $ LILRA6         : num  -0.571 -0.804 -1.225 -0.696 -0.374 ...
##  $ ADRBK2         : num  -0.7166 -0.0239 -0.5431 0.0757 -0.2658 ...
##  $ TAS2R3         : num  0.2325 0.1031 -0.036 -0.0213 -0.2399 ...
##  $ C1orf51        : num  0.1973 -0.1743 0.2388 0.1867 -0.0457 ...
##  $ CFTR           : num  0.25 -0.272 0.751 -0.632 -0.32 ...
##  $ HS3ST3B1       : num  -0.336 0.323 0.194 0.392 0.282 ...
##  $ LRRCC1         : num  -0.0765 -0.0728 -0.0306 0.1316 0.0904 ...
##  $ MPO            : num  -0.0461 -0.3983 -0.0862 -0.0156 -0.249 ...
##  $ LOC653562      : num  -0.158 1.295 0.837 -0.192 -0.407 ...
##  $ TPBG           : num  -0.287 0.947 0.276 -0.4 -0.591 ...
##  $ SCN11A         : num  -0.0107 0.0328 1.0342 -0.0841 -0.1941 ...
##  $ VSIG8          : num  0.627 -0.198 -0.332 0.423 -0.334 ...
##  $ CD2AP          : num  -0.0392 -0.1189 0.0636 0.1281 0.119 ...
##  $ FLJ23867       : num  -0.4395 0.4936 -0.0292 -0.3027 0.0265 ...
##  $ BEGAIN         : num  -0.3016 -0.1186 1.0694 -0.0139 -0.2875 ...
##  $ MGC34800       : num  -0.0912 -0.2683 -0.0734 -0.019 -0.2323 ...
##  $ P2RX7          : num  -0.699 0.453 -0.534 -0.214 0.141 ...
##  $ WDR46          : num  0.0299 -0.1665 -0.2608 -0.1359 -0.1444 ...
##  $ GSC            : num  0.3248 -0.1677 0.407 0.0255 0.3996 ...
##  $ C3orf42        : num  0.0669 -0.1948 0.4071 0.0616 -0.4221 ...
##  $ MTMR14         : num  -0.0949 -0.0122 -0.1157 0.1285 -0.0997 ...
##  $ FBXO48         : num  0.173 -0.098 -0.393 0.239 -0.116 ...
##  $ H2AFJ          : num  -0.633 -0.0892 -0.715 0.1485 0.3033 ...
##  $ KRTAP10.7      : num  0.734 0.257 1.111 -0.148 -0.135 ...
##  $ SNORA74A       : num  0.676 0.139 0.872 0.114 -0.38 ...
##  $ RASAL1         : num  0.11742 -0.00149 -0.02482 -0.23536 0.028 ...
##  $ ENSG00000229196: num  0.75822 -0.00579 0.93324 -0.02652 0.18702 ...
##  $ FBXO9          : num  0.0192 -0.0072 -0.6633 -0.0984 0.025 ...
##  $ SOX10          : num  0.0862 -0.1704 -0.1274 -0.0798 -0.3639 ...
##  $ C1orf127       : num  0.246 0.336 1.465 -0.343 -0.128 ...
##  $ F13A1          : num  -0.605 0.36 -0.218 -0.598 -0.223 ...
##  $ OR10A3         : num  -0.06915 0.05162 0.24496 0.00137 -0.28854 ...
##  $ LOC145757      : num  -0.0569 -0.04769 0.00873 -0.09158 -0.1248 ...
##  $ FAT2           : num  0.1919 -0.3219 0.3047 -0.0593 -0.1934 ...
##  $ OOEP           : num  0.177491 -0.000566 0.668561 0.199253 -0.173194 ...
##  $ EIF2C2         : num  0.4964 0.2373 -0.7133 0.3628 -0.0516 ...
##  $ MMP13          : num  0.0434 -0.0802 0.2907 -0.0312 -0.1319 ...
##  $ MAP2K3         : num  -0.7315 0.0363 0.1161 -0.0728 -0.1372 ...
##  $ BRSK1          : num  -0.604 0.0353 0.3551 -0.164 -0.3456 ...
##  $ RHOBTB2        : num  0.1453 0.1134 0.6744 0.1848 0.0606 ...
##  $ Septin.6       : num  -0.1437 -0.3411 -0.0162 -0.4229 -0.0351 ...
##  $ STRA8          : num  0.8642 0.0254 0.3683 0.3999 -0.0766 ...
##  $ ATP2C1         : num  -0.361 0.125 -0.659 0.119 0.149 ...
##  $ OR14A16        : num  0.0286 -0.0641 0.4736 0.0261 -0.3375 ...
##  $ AHSG           : num  0.192 -0.22 0.92 -0.15 -0.277 ...
##  $ ORC6L          : num  0.19 -1.369 -0.131 -0.742 0.143 ...
##  $ TLX1NB         : num  0.502 -0.256 0.807 -0.475 -0.145 ...
##  $ MTHFD2         : num  0.35248 0.31067 -1.05314 -0.01933 0.00216 ...
##  $ GMPPB          : num  0.0119 0.3596 0.5872 0.3397 0.0886 ...
##  $ RGPD3          : num  -0.000246 0.262285 -0.040798 -0.375937 0.190607 ...
##  $ FXR1           : num  0.012 0.4993 -0.3251 0.0827 0.3152 ...
##  $ GJA5           : num  0.3345 0.1924 0.4793 -0.0485 -0.1555 ...
##  $ ZNF836         : num  0.0683 -0.0257 0.1888 -0.057 -0.1004 ...
##  $ C16orf48       : num  0.2156 -0.0478 0.4009 0.3981 -0.21 ...
##  $ DNAJC24        : num  -0.4986 0.0249 -0.5918 0.3067 -0.0755 ...
##  $ TSHR           : num  -0.056 0.1009 0.3179 0.0243 -0.0942 ...
##  $ MTA3           : num  0.239 -0.491 -0.483 0.221 0.871 ...
##  $ NDST3          : num  -0.0178 0.0633 0.1542 -0.064 -0.0408 ...
##  $ OR12D2         : num  0.2254 -0.1449 0.066 0.135 -0.0408 ...
##  $ MKL1           : num  -0.33057 0.12237 0.00884 0.07873 -0.08919 ...
##  $ class          : Factor w/ 4 levels "1 month","6 month",..: 4 4 4 4 4 4 4 4 4 4 ...

set.seed(123)
intrain <- sample(1:86,.8*86)
training <- ML80[intrain,]
testing <- ML80[-intrain,]

result <- rfcv(training[,-81], training[,81], cv.fold=3)
print(result)

## $n.var
## [1] 80 40 20 10  5  2  1
## 
## $error.cv
##        80        40        20        10         5         2         1 
## 0.5294118 0.5000000 0.4852941 0.5294118 0.5882353 0.7058824 0.6470588 
## 
## $predicted
## $predicted$`80`
##  [1] acute   acute   acute   6 month 1 month acute   1 month 1 month healthy
## [10] 1 month acute   acute   healthy healthy acute   healthy 1 month acute  
## [19] acute   1 month 1 month acute   1 month acute   acute   acute   acute  
## [28] healthy acute   healthy healthy acute   acute   healthy acute   acute  
## [37] acute   acute   acute   acute   healthy healthy 1 month healthy 1 month
## [46] 1 month acute   healthy healthy healthy 1 month acute   healthy 1 month
## [55] healthy 1 month 1 month acute   acute   acute   acute   healthy 6 month
## [64] 1 month acute   healthy 1 month acute  
## Levels: 1 month 6 month acute healthy
## 
## $predicted$`40`
##  [1] acute   acute   acute   1 month 1 month acute   1 month acute   healthy
## [10] 1 month acute   acute   healthy healthy acute   healthy healthy acute  
## [19] acute   1 month healthy acute   1 month acute   1 month acute   acute  
## [28] healthy acute   healthy healthy acute   1 month healthy acute   acute  
## [37] healthy acute   acute   acute   healthy healthy 1 month healthy healthy
## [46] 1 month acute   healthy healthy healthy healthy acute   healthy 1 month
## [55] 1 month 6 month 1 month acute   acute   acute   acute   healthy 6 month
## [64] 1 month acute   healthy 1 month acute  
## Levels: 1 month 6 month acute healthy
## 
## $predicted$`20`
##  [1] acute   acute   acute   healthy 1 month acute   1 month acute   healthy
## [10] 1 month acute   acute   healthy 6 month acute   healthy healthy 1 month
## [19] acute   healthy healthy acute   1 month acute   healthy healthy acute  
## [28] healthy acute   healthy healthy acute   1 month healthy acute   acute  
## [37] 1 month acute   acute   acute   healthy healthy 1 month healthy healthy
## [46] 1 month acute   healthy healthy healthy healthy acute   healthy 1 month
## [55] 1 month acute   1 month acute   acute   acute   healthy healthy acute  
## [64] 1 month acute   healthy 1 month acute  
## Levels: 1 month 6 month acute healthy
## 
## $predicted$`10`
##  [1] acute   acute   1 month 6 month 1 month acute   1 month acute   healthy
## [10] 1 month acute   acute   healthy 6 month 1 month healthy healthy 1 month
## [19] acute   healthy healthy acute   1 month acute   healthy 6 month acute  
## [28] healthy acute   1 month healthy acute   acute   healthy acute   acute  
## [37] 1 month acute   acute   acute   healthy healthy acute   healthy healthy
## [46] 1 month acute   6 month healthy healthy healthy acute   healthy 1 month
## [55] 1 month acute   1 month acute   healthy acute   healthy healthy healthy
## [64] 1 month acute   healthy 1 month acute  
## Levels: 1 month 6 month acute healthy
## 
## $predicted$`5`
##  [1] acute   acute   1 month healthy 1 month acute   1 month 1 month healthy
## [10] 1 month acute   acute   healthy healthy acute   acute   healthy 1 month
## [19] acute   healthy healthy acute   1 month acute   healthy 6 month acute  
## [28] 1 month 1 month acute   healthy acute   1 month healthy acute   acute  
## [37] acute   1 month acute   acute   healthy healthy 6 month healthy 1 month
## [46] 1 month acute   healthy 1 month 6 month healthy acute   healthy 1 month
## [55] 1 month acute   1 month acute   6 month acute   healthy 6 month 6 month
## [64] 1 month acute   healthy 1 month acute  
## Levels: 1 month 6 month acute healthy
## 
## $predicted$`2`
##  [1] healthy acute   6 month healthy 1 month acute   1 month 6 month 1 month
## [10] acute   acute   acute   1 month healthy 1 month healthy 1 month 6 month
## [19] 6 month acute   1 month acute   healthy healthy healthy 6 month acute  
## [28] 1 month healthy healthy 1 month acute   healthy healthy acute   acute  
## [37] acute   acute   acute   acute   healthy 6 month healthy acute   6 month
## [46] 1 month acute   healthy 1 month acute   acute   acute   1 month 1 month
## [55] 6 month acute   acute   acute   healthy acute   healthy acute   acute  
## [64] healthy acute   healthy 1 month 1 month
## Levels: 1 month 6 month acute healthy
## 
## $predicted$`1`
##  [1] 6 month healthy 6 month healthy 1 month acute   healthy 6 month 6 month
## [10] acute   acute   acute   1 month healthy healthy healthy 1 month acute  
## [19] acute   acute   6 month acute   1 month 6 month 1 month 1 month healthy
## [28] healthy healthy healthy healthy acute   1 month healthy acute   acute  
## [37] healthy acute   acute   1 month healthy healthy healthy acute   6 month
## [46] healthy acute   healthy healthy acute   healthy acute   1 month 6 month
## [55] 6 month 1 month healthy acute   healthy acute   healthy 1 month acute  
## [64] healthy acute   acute   6 month acute  
## Levels: 1 month 6 month acute healthy

bestPredicted <- data.frame(predicted=result$predicted[3],type=training$class)
bestPredicted

##        X20    type
## 1    acute   acute
## 2    acute 6 month
## 3    acute 1 month
## 4  healthy healthy
## 5  1 month 1 month
## 6    acute   acute
## 7  1 month 1 month
## 8    acute   acute
## 9  healthy 6 month
## 10 1 month   acute
## 11   acute 1 month
## 12   acute 1 month
## 13 healthy healthy
## 14 6 month 1 month
## 15   acute   acute
## 16 healthy healthy
## 17 healthy 6 month
## 18 1 month 1 month
## 19   acute   acute
## 20 healthy 6 month
## 21 healthy healthy
## 22   acute   acute
## 23 1 month 1 month
## 24   acute   acute
## 25 healthy 1 month
## 26 healthy 1 month
## 27   acute   acute
## 28 healthy healthy
## 29   acute   acute
## 30 healthy   acute
## 31 healthy 1 month
## 32   acute 1 month
## 33 1 month 1 month
## 34 healthy 1 month
## 35   acute 1 month
## 36   acute   acute
## 37 1 month 6 month
## 38   acute   acute
## 39   acute   acute
## 40   acute healthy
## 41 healthy healthy
## 42 healthy healthy
## 43 1 month healthy
## 44 healthy healthy
## 45 healthy   acute
## 46 1 month 1 month
## 47   acute 6 month
## 48 healthy 6 month
## 49 healthy healthy
## 50 healthy 1 month
## 51 healthy 1 month
## 52   acute 6 month
## 53 healthy healthy
## 54 1 month   acute
## 55 1 month healthy
## 56   acute healthy
## 57 1 month   acute
## 58   acute   acute
## 59   acute   acute
## 60   acute 1 month
## 61 healthy   acute
## 62 healthy healthy
## 63   acute   acute
## 64 1 month 6 month
## 65   acute   acute
## 66 healthy healthy
## 67 1 month healthy
## 68   acute   acute

sum(bestPredicted$X20==bestPredicted$type)/length(bestPredicted$X20)

## [1] 0.5147059

This used 20 features and got the least error but results in accuracy are terrible. Now the actual random forest function in the package.

set.seed(123)
ML80.rf <- randomForest(class ~ ., data=training, ntree=10000, keep.forest=TRUE,
                          importance=TRUE)
print(ML80.rf)

## 
## Call:
##  randomForest(formula = class ~ ., data = training, ntree = 10000,      keep.forest = TRUE, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 10000
## No. of variables tried at each split: 8
## 
##         OOB estimate of  error rate: 55.88%
## Confusion matrix:
##         1 month 6 month acute healthy class.error
## 1 month       4       1     9       5   0.7894737
## 6 month       2       0     2       5   1.0000000
## acute         5       1    16       1   0.3043478
## healthy       3       2     2      10   0.4117647

The predictors are best at predicting the acute or healthy cases but not the 6 month at all. These were the top 80

variableImportance <- data.frame(ML80.rf$importance)
variableImportance

##                      X1.month      X6.month         acute       healthy
## NEK6             1.974351e-04 -3.292857e-04  8.552170e-05 -2.851263e-04
## HECTD1          -1.284888e-03  7.248810e-04  4.198399e-05  2.918687e-04
## PDE4DIP          2.285871e-03 -1.536429e-03  1.265333e-02  9.851602e-03
## POU2F2          -5.980098e-04 -6.357143e-04  1.916839e-04 -9.329834e-04
## TMX3             4.191026e-04  1.692619e-03  1.882240e-04 -4.310281e-04
## KAT5             1.726551e-04 -1.708333e-03 -6.633264e-04 -1.354264e-03
## PLCXD1          -2.525278e-04  4.195119e-03  6.713683e-03  1.426802e-02
## MPDU1           -1.024509e-03  5.993571e-03  2.997404e-03  5.952554e-03
## PCSK1N           5.730988e-04 -1.069167e-03  3.408525e-04  9.876154e-04
## TCEAL4           2.320814e-03  9.238095e-04  2.483950e-03 -3.909632e-04
## KRT36           -8.934488e-04 -1.036667e-03  4.749023e-04  4.627706e-05
## HSD11B1         -6.795768e-04 -9.776190e-04 -1.996923e-04 -5.026154e-04
## LOC100287314     8.642424e-04 -1.985238e-03  3.275191e-04  9.191198e-04
## CHGA            -1.352977e-03  7.965714e-03 -3.401071e-05 -8.655700e-04
## STK38L          -8.818975e-04 -1.061667e-03 -4.119408e-04 -3.406746e-04
## BRD3            -7.213731e-04 -7.416667e-05 -5.559645e-04  2.725476e-03
## TMSB4Y          -9.037410e-04  5.086667e-03  5.909507e-05  3.516609e-03
## GGT7            -9.113215e-04 -1.216667e-03 -5.410847e-04 -2.234416e-04
## RNF217           4.488864e-04 -2.040000e-03 -7.838764e-04  7.766703e-04
## ALAS2           -1.188402e-03 -6.626190e-04  1.023624e-03 -6.522319e-04
## LILRA6           2.839207e-03 -1.080000e-03  3.377771e-02  1.424508e-02
## ADRBK2           8.962796e-04 -2.159286e-03  2.883323e-03 -1.064711e-03
## TAS2R3          -1.385606e-04  1.522381e-03 -5.549728e-04 -9.244372e-04
## C1orf51         -8.283270e-04 -7.583333e-04 -6.415321e-04 -9.344877e-04
## CFTR            -4.741234e-04 -7.033333e-04 -5.652926e-04 -1.250794e-04
## HS3ST3B1        -5.901623e-04  3.150000e-04  1.663434e-04  3.888889e-05
## LRRCC1          -5.834443e-06 -8.833333e-04 -4.550325e-04  1.394369e-03
## MPO             -3.530254e-03 -2.317262e-03  6.070642e-03  6.943828e-03
## LOC653562       -9.442713e-04  3.359524e-04 -1.212044e-04 -2.877020e-04
## TPBG            -5.254365e-04 -1.966667e-04 -1.015548e-03 -4.573413e-04
## SCN11A          -5.679762e-04  6.366667e-04 -3.093726e-04  3.642821e-04
## VSIG8           -2.573088e-04 -6.833333e-05 -4.862388e-04 -4.927350e-05
## CD2AP           -5.925214e-05 -1.200952e-03 -6.194797e-04  3.993290e-04
## FLJ23867         9.941295e-04 -2.634286e-03  6.876315e-04  9.727237e-03
## BEGAIN           2.046520e-03  1.032619e-03 -6.999170e-04 -1.260325e-03
## MGC34800         1.626015e-03 -9.261905e-05 -4.369722e-04 -8.397294e-04
## P2RX7           -2.874459e-05  1.012619e-03  8.740327e-03  1.433532e-03
## WDR46           -1.815771e-03 -1.620952e-03  1.600214e-03  5.361223e-03
## GSC             -3.374747e-04  1.760952e-03  1.772078e-04  3.808730e-04
## C3orf42         -5.017205e-05 -3.445238e-04 -4.190820e-04 -4.230483e-04
## MTMR14          -6.035642e-04  2.114762e-03  2.300362e-03 -1.238045e-03
## FBXO48          -6.107157e-04  2.997619e-04 -3.655375e-04  1.099531e-04
## H2AFJ            7.025000e-04  3.464286e-04  2.823888e-03  4.744148e-03
## KRTAP10.7       -7.190249e-04  3.585000e-03 -5.933081e-04 -7.578860e-04
## SNORA74A         6.677861e-04 -2.249405e-03  8.034345e-03  1.314221e-03
## RASAL1          -8.204906e-05 -3.602381e-04 -3.286880e-04 -5.424603e-04
## ENSG00000229196 -1.455999e-03 -2.876190e-04 -2.684527e-04 -2.683983e-04
## FBXO9           -7.120815e-04  5.738095e-05  4.890390e-04  3.212845e-03
## SOX10           -1.261472e-04 -1.383333e-04 -3.349062e-04 -5.091270e-05
## C1orf127        -3.785354e-05 -2.035714e-04 -3.317757e-04 -2.548088e-04
## F13A1           -2.749952e-03  5.108333e-03  7.095081e-03 -1.239899e-04
## OR10A3          -8.127500e-04 -5.433333e-04 -4.744986e-04 -3.907143e-04
## LOC145757        5.150627e-04 -2.045000e-03  1.842774e-04  4.520613e-03
## FAT2            -5.146861e-04  1.788333e-03 -2.044134e-04  7.615909e-04
## OOEP            -1.820346e-05 -8.666667e-04 -1.365008e-03  1.151263e-04
## EIF2C2          -5.535137e-04  4.002381e-04  2.278918e-04  7.535354e-04
## MMP13           -1.906566e-04  7.116667e-04 -4.410728e-05 -5.623918e-04
## MAP2K3          -1.434003e-03  7.590476e-04  1.898859e-03  8.121825e-04
## BRSK1           -1.404646e-03  2.861667e-03  1.299509e-02  1.171913e-02
## RHOBTB2         -7.349939e-04 -1.182619e-03  6.101537e-04  9.940476e-05
## Septin.6         1.609333e-03  1.366667e-04 -2.306087e-04  1.805234e-03
## STRA8            2.506507e-04 -7.066667e-04 -1.186548e-03 -6.760714e-04
## ATP2C1          -1.399531e-04 -1.095952e-03 -4.790132e-04  8.758297e-05
## OR14A16         -7.488614e-04  3.720000e-03 -5.020256e-04 -2.622583e-04
## AHSG             7.165260e-04  1.514762e-03 -1.113413e-03  4.628499e-04
## ORC6L           -1.011624e-04  6.109524e-04  7.262432e-04  3.013564e-03
## TLX1NB          -4.712698e-04 -7.566667e-04 -7.107035e-04  5.088492e-04
## MTHFD2           8.634740e-04  4.579762e-04  2.003276e-03  9.091270e-05
## GMPPB           -7.762060e-04  3.300000e-04  4.077717e-04 -2.629113e-04
## RGPD3           -5.790801e-04 -1.823810e-04 -7.207035e-04 -2.659993e-04
## FXR1            -1.271634e-03 -6.480952e-04  1.074139e-03 -3.271104e-04
## GJA5             7.243193e-04  3.497619e-04  2.181167e-03 -1.640519e-03
## ZNF836           4.431843e-04 -1.108571e-03 -3.985792e-05  2.763492e-04
## C16orf48        -8.427128e-05 -7.935714e-04 -2.499215e-04 -2.623810e-04
## DNAJC24          2.164037e-02 -1.300119e-03  2.093149e-03  7.514390e-03
## TSHR            -1.945666e-03 -2.695238e-04  3.921679e-03 -1.711876e-03
## MTA3            -6.852886e-04  1.571429e-05 -6.220635e-04  2.546501e-04
## NDST3            2.269300e-04 -1.046667e-03 -8.900794e-05 -6.222522e-04
## OR12D2          -4.028608e-04 -9.033333e-04 -6.600358e-04 -8.977886e-04
## MKL1             1.023160e-04  3.166667e-05  1.324773e-03  1.956349e-05
##                 MeanDecreaseAccuracy MeanDecreaseGini
## NEK6                    2.872128e-05        0.4589293
## HECTD1                 -2.174431e-04        0.5157746
## PDE4DIP                 6.777219e-03        1.3999604
## POU2F2                 -3.595395e-04        0.3954239
## TMX3                    3.014424e-04        0.5570351
## KAT5                   -6.509862e-04        0.4162122
## PLCXD1                  5.905106e-03        1.3618037
## MPDU1                   2.763996e-03        0.9856564
## PCSK1N                  3.495545e-04        0.5304417
## TCEAL4                  1.448099e-03        0.5912317
## KRT36                  -1.503262e-04        0.3952479
## HSD11B1                -5.172632e-04        0.3304101
## LOC100287314            3.096906e-04        0.6100534
## CHGA                    2.162620e-04        0.6150855
## STK38L                 -5.331447e-04        0.3985381
## BRD3                    3.302302e-04        0.5888511
## TMSB4Y                  1.039050e-03        0.7175469
## GGT7                   -5.937594e-04        0.3342483
## RNF217                 -2.959888e-04        0.4962428
## ALAS2                  -2.821824e-04        0.5104183
## LILRA6                  1.484886e-02        2.3446198
## ADRBK2                  6.533095e-04        0.5355615
## TAS2R3                 -2.430337e-04        0.4652468
## C1orf51                -7.318249e-04        0.3552346
## CFTR                   -4.469058e-04        0.3247389
## HS3ST3B1               -1.625812e-04        0.4923773
## LRRCC1                  9.067389e-05        0.4370770
## MPO                     2.366932e-03        1.0157770
## LOC653562              -3.804756e-04        0.5558021
## TPBG                   -6.311377e-04        0.3366068
## SCN11A                 -4.738956e-05        0.4383964
## VSIG8                  -2.714597e-04        0.4250916
## CD2AP                  -3.363006e-04        0.4615719
## FLJ23867                2.494598e-03        1.0046261
## BEGAIN                  9.947838e-05        0.5266534
## MGC34800                1.809359e-04        0.5726299
## P2RX7                   3.315751e-03        0.9314206
## WDR46                   1.203960e-03        0.7684234
## GSC                     1.598551e-04        0.5463876
## C3orf42                -2.386917e-04        0.3506272
## MTMR14                  5.728359e-04        0.6699965
## FBXO48                 -2.683464e-04        0.3947342
## H2AFJ                   1.988862e-03        1.0550720
## KRTAP10.7              -1.371929e-04        0.5625267
## SNORA74A                2.814653e-03        0.9362515
## RASAL1                 -2.537410e-04        0.3562299
## ENSG00000229196        -6.112341e-04        0.4269099
## FBXO9                   6.654369e-04        0.6961899
## SOX10                  -1.025804e-04        0.4788161
## C1orf127               -2.302498e-04        0.4154327
## F13A1                   2.110151e-03        1.0254595
## OR10A3                 -4.889625e-04        0.4279342
## LOC145757               9.878939e-04        0.6239428
## FAT2                    1.304458e-04        0.5913211
## OOEP                   -4.986156e-04        0.4146760
## EIF2C2                  2.412391e-04        0.5792137
## MMP13                  -1.667566e-04        0.4245617
## MAP2K3                  5.174084e-04        0.6661249
## BRSK1                   6.941550e-03        1.1518720
## RHOBTB2                -8.820358e-05        0.4956573
## Septin.6                8.963937e-04        0.6698110
## STRA8                  -5.344679e-04        0.4113137
## ATP2C1                 -3.376182e-04        0.4450835
## OR14A16                -1.157674e-04        0.4908333
## AHSG                    3.693337e-05        0.4334393
## ORC6L                   1.043335e-03        0.8923605
## TLX1NB                 -3.270081e-04        0.3943080
## MTHFD2                  1.029491e-03        0.7150768
## GMPPB                  -9.926669e-05        0.3748746
## RGPD3                  -4.544043e-04        0.4032293
## FXR1                   -6.308354e-05        0.5170377
## GJA5                    5.400380e-04        0.6246474
## ZNF836                  9.541820e-05        0.4513524
## C16orf48               -1.354554e-04        0.3652100
## DNAJC24                 7.921279e-03        1.6578556
## TSHR                    3.837562e-04        0.7023384
## MTA3                   -3.749652e-04        0.4890602
## NDST3                  -2.258230e-04        0.3999704
## OR12D2                 -7.141836e-04        0.3299746
## MKL1                    4.188917e-04        0.4796318

Now order by decreasing for the 4 classes and important features.

acuteTop <- variableImportance[order(variableImportance$acute, decreasing=T),]
acuteTop$Gene <- row.names(acuteTop)
head(acuteTop)

##               X1.month     X6.month       acute       healthy
## LILRA6    2.839207e-03 -0.001080000 0.033777711  0.0142450758
## BRSK1    -1.404646e-03  0.002861667 0.012995086  0.0117191270
## PDE4DIP   2.285871e-03 -0.001536429 0.012653335  0.0098516017
## P2RX7    -2.874459e-05  0.001012619 0.008740327  0.0014335317
## SNORA74A  6.677861e-04 -0.002249405 0.008034345  0.0013142208
## F13A1    -2.749952e-03  0.005108333 0.007095081 -0.0001239899
##          MeanDecreaseAccuracy MeanDecreaseGini     Gene
## LILRA6            0.014848864        2.3446198   LILRA6
## BRSK1             0.006941550        1.1518720    BRSK1
## PDE4DIP           0.006777219        1.3999604  PDE4DIP
## P2RX7             0.003315751        0.9314206    P2RX7
## SNORA74A          0.002814653        0.9362515 SNORA74A
## F13A1             0.002110151        1.0254595    F13A1

The above genes are not the genes we found when only scraping top 20 and bottom 20 fold change genes of acute/healthy, month1/healthy, or month6/healthy. This Rstudio run gave the top 5 genes as LILRA6, BRSK1, PDE4DIP, P2RX7, and SNORA74A. This data is using a different fold change value that was there but not used of the acute infection compared to six months after infection.

predictions <- predict(ML80.rf, newdata=testing)
predictions

##       1       3      19      20      24      35      37      39      47      52 
## healthy healthy   acute healthy 1 month   acute   acute   acute   acute   acute 
##      58      59      60      61      63      64      68      78 
## 1 month 1 month 1 month   acute   acute 1 month   acute   acute 
## Levels: 1 month 6 month acute healthy

sum(predictions==testing$class)/length(testing$class)

## [1] 0.6111111

set.seed(123)
modelResult <- data.frame(predictions, type=testing$class)
modelResult

##    predictions    type
## 1      healthy healthy
## 3      healthy healthy
## 19       acute healthy
## 20     healthy healthy
## 24     1 month   acute
## 35       acute   acute
## 37       acute   acute
## 39       acute   acute
## 47       acute   acute
## 52       acute 1 month
## 58     1 month 1 month
## 59     1 month 1 month
## 60     1 month 1 month
## 61       acute 1 month
## 63       acute 1 month
## 64     1 month 1 month
## 68       acute 1 month
## 78       acute 6 month

This model scored 61% accuracy using fold change values and these parameter settings for random forest in the randomForest package of 80% data trained, 20% tested on, and 10k trees with default parameter settings of randomForest function.

Tune a model to find best mtry to randomForest function. The number of mtry parameter is using how many of our 80 features is best for the one out bag calculations. Parameter settings for doBest set to TRUE is to run the forest model with best mtry value found, the stepFactor is to run by this number, and improve parameter is to keep tuning until you cannot get better than the value you set for improve with each run or iteration.

tunedModel <- tuneRF(training[,-81], training[,81], ntreetry= 10000,stepFactor= 1.2
                     , doBest=TRUE, improve=.6)

## mtry = 8  OOB error = 58.82% 
## Searching left ...
## mtry = 7     OOB error = 55.88% 
## 0.05 0.6 
## Searching right ...
## mtry = 9     OOB error = 47.06% 
## 0.2 0.6

set.seed(123)
ML80.rf.tuned <- randomForest(class ~ ., data=training, mtry=9, ntree=10000, keep.forest=TRUE,
                          importance=TRUE)
print(ML80.rf.tuned)

## 
## Call:
##  randomForest(formula = class ~ ., data = training, mtry = 9,      ntree = 10000, keep.forest = TRUE, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 10000
## No. of variables tried at each split: 9
## 
##         OOB estimate of  error rate: 55.88%
## Confusion matrix:
##         1 month 6 month acute healthy class.error
## 1 month       5       1     9       4   0.7368421
## 6 month       2       0     2       5   1.0000000
## acute         5       1    15       2   0.3478261
## healthy       3       2     2      10   0.4117647

importance(ML80.rf.tuned)

##                     1 month     6 month        acute     healthy
## NEK6             3.17844419 -0.20591611 -0.181587507  0.45657905
## HECTD1          -3.23224966  2.01230032 -0.456387976  3.15973725
## PDE4DIP          5.50282801 -4.96272788 19.532684711 14.73682403
## POU2F2          -0.97868881 -1.66621045  1.232438965  0.10915371
## TMX3             1.77677173  3.40399040  1.330505606 -1.66766137
## KAT5             2.13500878 -0.98935876 -1.064391432 -3.34283616
## PLCXD1           0.05115988  2.06338225 12.085639666 20.01562952
## MPDU1           -1.70015009 10.59636397  5.439229817 12.07007164
## PCSK1N           2.32622886 -4.29942757  0.636711116  2.18952169
## TCEAL4           4.41129695  2.00755649  5.759289050 -1.61974637
## KRT36           -2.41543261 -1.79889318  0.972345937 -1.28134488
## HSD11B1         -2.56394381 -1.15265324 -3.357116862 -0.39810103
## LOC100287314     1.89120092 -3.78265392  1.151139943  2.70494631
## CHGA            -3.11532343  8.51944071  0.004472316 -0.86066088
## STK38L          -1.36470741 -0.18506387 -0.054555371 -1.32198785
## BRD3            -2.49293421 -0.00936427  0.641002086  5.61718981
## TMSB4Y          -2.03867654  9.10506911 -2.646282514  5.23276317
## GGT7            -4.16649318 -2.18515925  0.282770347 -3.22123669
## RNF217           1.09015687 -1.40395981 -1.898577108  2.29843030
## ALAS2           -4.05131872 -1.68428741  0.864569347 -2.61258600
## LILRA6           3.00883618 -1.67575108 34.208248136 20.14631723
## ADRBK2           4.31417460 -3.79667201  7.122397903 -1.40847347
## TAS2R3          -2.19917341  2.77215285 -2.118100737 -2.81210034
## C1orf51         -3.35600084 -1.75780038 -2.046965391 -3.22988227
## CFTR            -2.85414406 -0.59670304 -3.688653390 -1.08951558
## HS3ST3B1        -2.33061149 -0.54571099  2.786566431 -0.38925956
## LRRCC1          -0.90377577 -1.89223051 -2.965499292  2.21031475
## MPO             -5.98461589 -3.53479480 11.320137675 13.62988082
## LOC653562       -0.25411489 -0.70623610  0.179256951  0.53852482
## TPBG            -2.32881961 -1.68535854 -3.487047713 -3.82323852
## SCN11A          -0.85873735  2.07357111 -1.192044616  1.94138201
## VSIG8           -3.47982755 -0.26420677 -1.815027755  0.05763974
## CD2AP           -2.50591893 -3.68639654 -0.481841054  0.81889122
## FLJ23867         4.42557672 -4.76616737  0.036046521 14.98017631
## BEGAIN           3.85298948  0.15046666 -2.791367131 -1.45165018
## MGC34800         5.22561779 -1.55677007 -0.872611462 -3.22891960
## P2RX7           -0.76096009  2.63268898 16.948178258  3.52474644
## WDR46           -4.50074546 -2.97246049  4.283060482 11.85131167
## GSC             -4.22936114  2.32756918  0.771601652 -0.55954593
## C3orf42         -0.93613658 -2.21768865 -1.650824455 -3.90974555
## MTMR14          -2.49888385  2.14302169  6.252696884 -2.57447892
## FBXO48          -2.77187454 -1.96255818 -0.219189489 -1.46693994
## H2AFJ           -0.50583933 -0.16171982  5.655937303  9.07740847
## KRTAP10.7       -1.19036069  6.40120355 -2.222688086 -2.77630340
## SNORA74A         3.88576774 -3.69759347 16.149637644  1.09875782
## RASAL1          -2.81177660  0.06366687 -2.933931687 -3.18227380
## ENSG00000229196 -3.96883511 -1.52556777 -2.782718461  0.53452710
## FBXO9           -2.30069809 -1.63171936  0.570024247  6.25378097
## SOX10            1.31806016 -1.35043453 -1.140234952 -2.23496821
## C1orf127         0.55341063 -0.03530793 -2.533537395  0.56839388
## F13A1           -5.88077459  8.30790264 14.119805385  1.50085833
## OR10A3          -0.79076088 -1.32404106  0.576778624  1.68343903
## LOC145757        1.05989949 -4.42575815 -0.114942752 10.14242654
## FAT2            -0.67230521  1.24938167 -1.143370605  1.36840666
## OOEP            -2.17898701 -4.14634944 -5.360137041  1.57056264
## EIF2C2          -0.25343857 -0.50234771 -0.278105830  0.73947703
## MMP13           -4.25949125  2.15163178  1.174678886 -3.31696369
## MAP2K3          -3.69766270  1.80880254  4.358339124  0.28895062
## BRSK1           -1.84613588  4.01896130 19.208748573 15.66402314
## RHOBTB2         -3.47418297 -3.00421646  0.611333970  1.37349661
## Septin.6         1.27441228  0.46150232 -0.311437052  6.62585412
## STRA8           -1.82117373 -1.17639253 -2.184584069 -1.27988366
## ATP2C1          -1.05430677 -3.29612966 -1.276168144  2.26361634
## OR14A16         -0.85530413  5.51835041 -1.521453427 -1.58313532
## AHSG            -0.09273714  4.28090409 -3.507839512 -0.75226232
## ORC6L           -0.43637922  0.19477909  1.758969828  5.69571078
## TLX1NB          -0.29279018 -2.58847157 -2.643178648  0.15947685
## MTHFD2           3.35088428  0.44182961  2.720611326  3.53590199
## GMPPB           -1.63755788  0.68188114  0.148882673 -0.75541725
## RGPD3           -3.14600483 -2.77011872 -2.380442814 -1.96091162
## FXR1            -2.61676762 -3.14271707  3.434424820  1.55321105
## GJA5             1.37297993 -0.08861429  4.572555961 -3.83214182
## ZNF836          -1.04521754 -5.15611373 -2.212134657 -0.31807492
## C16orf48        -1.80484440 -2.21415806 -0.994680410 -2.28364894
## DNAJC24         26.37460352 -0.16654904  3.621126222 12.44330299
## TSHR            -5.76097167  0.01032284 12.562694740 -3.43890403
## MTA3            -2.00615513 -1.86973795 -1.470729555 -0.37929565
## NDST3            0.36789166 -1.73556731 -0.987385495 -0.93164578
## OR12D2          -1.13819608 -2.50780767 -4.146718290 -3.47672020
## MKL1             0.54018706 -0.78513365  4.864803360 -2.05855060
##                 MeanDecreaseAccuracy MeanDecreaseGini
## NEK6                      1.16823440        0.4351816
## HECTD1                    0.47489509        0.4673643
## PDE4DIP                  20.22962331        1.5013514
## POU2F2                   -0.73716890        0.3741187
## TMX3                      2.18030576        0.6238894
## KAT5                     -1.37198197        0.4208338
## PLCXD1                   18.14244752        1.4557262
## MPDU1                    10.58400672        1.0063831
## PCSK1N                    1.29183550        0.5486337
## TCEAL4                    5.32258633        0.6060380
## KRT36                    -1.97340893        0.3672726
## HSD11B1                  -3.52434651        0.3137500
## LOC100287314              1.44617277        0.5906007
## CHGA                      1.75993366        0.6349335
## STK38L                   -1.31905098        0.3908343
## BRD3                      2.25013512        0.5905398
## TMSB4Y                    3.61988139        0.7394827
## GGT7                     -4.38632089        0.3097573
## RNF217                   -0.01196054        0.4795363
## ALAS2                    -2.94598963        0.4973330
## LILRA6                   32.94076878        2.4764896
## ADRBK2                    6.01364213        0.5134961
## TAS2R3                   -2.58852148        0.4461379
## C1orf51                  -5.25365014        0.3395588
## CFTR                     -4.56412097        0.3321699
## HS3ST3B1                 -0.60373500        0.4847396
## LRRCC1                   -2.04360330        0.4147686
## MPO                       9.48435629        1.0529854
## LOC653562                 0.03313130        0.5408874
## TPBG                     -5.32252139        0.3198409
## SCN11A                    0.50478459        0.4078703
## VSIG8                    -2.96955571        0.4326255
## CD2AP                    -2.90556360        0.4619491
## FLJ23867                  9.29083843        0.9600182
## BEGAIN                    0.06751539        0.5347422
## MGC34800                  0.80578677        0.5620145
## P2RX7                    13.88018035        0.9448540
## WDR46                     6.18230764        0.7482990
## GSC                      -1.33267261        0.5633950
## C3orf42                  -3.70206883        0.3191279
## MTMR14                    1.86662049        0.6750362
## FBXO48                   -2.91835665        0.3779377
## H2AFJ                     7.56750363        1.0969368
## KRTAP10.7                -1.31744201        0.5456194
## SNORA74A                 11.27666348        0.8980419
## RASAL1                   -4.88741602        0.3703968
## ENSG00000229196          -3.76988346        0.4109683
## FBXO9                     2.46258369        0.6969781
## SOX10                    -1.21349086        0.4607025
## C1orf127                 -0.94672243        0.4210986
## F13A1                     9.71149499        1.0620147
## OR10A3                    0.54963710        0.4284370
## LOC145757                 4.71883266        0.6176924
## FAT2                     -0.05211871        0.5636897
## OOEP                     -4.73836558        0.3935767
## EIF2C2                    0.09635417        0.6009054
## MMP13                    -2.06842161        0.4133290
## MAP2K3                    1.04416808        0.6763739
## BRSK1                    19.87439214        1.1816095
## RHOBTB2                  -1.78224177        0.4718620
## Septin.6                  4.12097645        0.6607393
## STRA8                    -3.50343517        0.4091857
## ATP2C1                   -1.54121365        0.4234493
## OR14A16                  -0.15282018        0.5077850
## AHSG                     -1.16233381        0.4118094
## ORC6L                     4.02713563        0.9164653
## TLX1NB                   -2.23597922        0.3893407
## MTHFD2                    5.32288040        0.7195708
## GMPPB                    -0.81274239        0.3901742
## RGPD3                    -4.53826166        0.4156880
## FXR1                      1.28468334        0.4801976
## GJA5                      1.93215376        0.5839072
## ZNF836                   -3.52868805        0.4453706
## C16orf48                 -3.51418085        0.3405220
## DNAJC24                  24.51004801        1.7768431
## TSHR                      4.01876485        0.6547802
## MTA3                     -2.19073845        0.4818738
## NDST3                    -1.63901814        0.3787438
## OR12D2                   -5.69861114        0.3140389
## MKL1                      1.65277453        0.4683902

predictions1 <- predict(ML80.rf.tuned, newdata=testing)
predictions1

##       1       3      19      20      24      35      37      39      47      52 
## healthy healthy 1 month healthy 1 month   acute   acute   acute   acute   acute 
##      58      59      60      61      63      64      68      78 
## 1 month 1 month 1 month   acute   acute 1 month   acute   acute 
## Levels: 1 month 6 month acute healthy

prediction1_df <- data.frame(predictions1,type=testing$class)
prediction1_df

##    predictions1    type
## 1       healthy healthy
## 3       healthy healthy
## 19      1 month healthy
## 20      healthy healthy
## 24      1 month   acute
## 35        acute   acute
## 37        acute   acute
## 39        acute   acute
## 47        acute   acute
## 52        acute 1 month
## 58      1 month 1 month
## 59      1 month 1 month
## 60      1 month 1 month
## 61        acute 1 month
## 63        acute 1 month
## 64      1 month 1 month
## 68        acute 1 month
## 78        acute 6 month

sum(prediction1_df$predictions1==prediction1_df$type)/length(prediction1_df$predictions1)

## [1] 0.6111111

61% accurate

After tuning the model, the results were the same as the first run with default parameters of 61% accuracy in prediction. Probably because the data was limited in class type to begin with. So we can just try the fold change values to see if only keeping the acute and chronic cases would give better results. Two classes should have better odds than four classes, but again there are the least amount of chronic classes and our model inaccurately predicted any of the chronic cases even though these are the top gene expression genes in fold change values of acute to chronic infection.

Lets just see the imbalance again.

ML80 %>% group_by(class) %>% count(class)

## # A tibble: 4 × 2
## # Groups:   class [4]
##   class       n
##   <fct>   <int>
## 1 1 month    27
## 2 6 month    10
## 3 acute      28
## 4 healthy    21

So, we are to remove the 1 month class of 27 samples and the healthy 21 samples. We are left with 10 chronic samples and 28 acute samples.

ML80_onlyAcute6month <- subset(ML80, ML80$class=="acute" | ML80$class=='6 month')
ML80_onlyAcute6month$class

##  [1] acute   acute   acute   acute   acute   acute   acute   acute   acute  
## [10] acute   acute   acute   acute   acute   acute   acute   acute   acute  
## [19] acute   acute   acute   acute   acute   acute   acute   acute   acute  
## [28] acute   6 month 6 month 6 month 6 month 6 month 6 month 6 month 6 month
## [37] 6 month 6 month
## Levels: 1 month 6 month acute healthy

ML80_onlyAcute6month %>% group_by(class) %>% count(class)

## # A tibble: 2 × 2
## # Groups:   class [2]
##   class       n
##   <fct>   <int>
## 1 6 month    10
## 2 acute      28

There are 10 samples of 6 month and 28 of acute. Lets get rolling with our training and testing sets and then the randomForest model tuning and building.

set.seed(123)

intrain <- sample(1:38, .8*38)
training <- ML80_onlyAcute6month[intrain,]
testing <- ML80_onlyAcute6month[-intrain,]

training %>% group_by(class) %>% count(class)

## # A tibble: 2 × 2
## # Groups:   class [2]
##   class       n
##   <fct>   <int>
## 1 6 month     9
## 2 acute      21

The model is building a classifier off 9 observations of chronic cases and 21 acute cases.

testing %>% group_by(class) %>% count(class)

## # A tibble: 2 × 2
## # Groups:   class [2]
##   class       n
##   <fct>   <int>
## 1 6 month     1
## 2 acute       7

The testing class will have 1 class of chronic and 7 cases of acute to predict accurately.

Lets see how this goes. Lets tune first for best mtry to use then use the randomForest function. run or iteration.

# tunedModel1 <- tuneRF(training[,1:80], training[,-81], ntreetry= 10000,stepFactor= .5
#                      , doBest=TRUE, improve=.5)

“Error in randomForest.default(x, y, mtry = mtryStart, ntree = ntreeTry, : length of response must be the same as predictors”

The above tuning won’t work, must be because of the limitation on classes in the target only having 1 chronic case compared to 7. So we move on to the random forest modeling.

# set.seed(123)
# ML80.rf.tuned <- randomForest(class ~ ., data=training, mtry=9, ntree=10000, keep.forest=TRUE,
#                           importance=TRUE)
# print(ML80.rf.tuned)

“Error in randomForest.default(m, y, …) : Can’t have empty classes in y.” Maybe a problem with factorization as the str() will probably give some information, should only have 2 classes but showed 4.

str(ML80_onlyAcute6month$class)

##  Factor w/ 4 levels "1 month","6 month",..: 3 3 3 3 3 3 3 3 3 3 ...

So we have a problem, there are only 2 classes from our removal of two and yet the string attaches those factors. Lets try changing it to character and back to factor.

ML80_onlyAcute6month$class <- as.character(ML80_onlyAcute6month$class)
str(ML80_onlyAcute6month$class)

##  chr [1:38] "acute" "acute" "acute" "acute" "acute" "acute" "acute" "acute" ...

ML80_onlyAcute6month$class

##  [1] "acute"   "acute"   "acute"   "acute"   "acute"   "acute"   "acute"  
##  [8] "acute"   "acute"   "acute"   "acute"   "acute"   "acute"   "acute"  
## [15] "acute"   "acute"   "acute"   "acute"   "acute"   "acute"   "acute"  
## [22] "acute"   "acute"   "acute"   "acute"   "acute"   "acute"   "acute"  
## [29] "6 month" "6 month" "6 month" "6 month" "6 month" "6 month" "6 month"
## [36] "6 month" "6 month" "6 month"

There are only 2 character entries, now change back to factors and see if valid.

ML80_onlyAcute6month$class <- as.factor(ML80_onlyAcute6month$class)
str(ML80_onlyAcute6month$class)

##  Factor w/ 2 levels "6 month","acute": 2 2 2 2 2 2 2 2 2 2 ...

That seemed to work, so we move forward with tuning and modeling.

training <- ML80_onlyAcute6month[intrain,]
testing <- ML80_onlyAcute6month[-intrain,]
tunedModel1 <- tuneRF(training[,1:80], training$class, ntreetry= 10000,stepFactor= .5
                     , doBest=TRUE, improve=.5)

## mtry = 8  OOB error = 30% 
## Searching left ...
## mtry = 16    OOB error = 20% 
## 0.3333333 0.5 
## Searching right ...
## mtry = 4     OOB error = 33.33% 
## -0.1111111 0.5

The above says less than 24% error in out of bag samples with at least 8 features of mtry. So making the model with this number now.

set.seed(123)
ML80.rf.tuned1 <- randomForest(class ~ ., data=training, mtry=8, ntree=10000, keep.forest=TRUE,
                          importance=TRUE)
print(ML80.rf.tuned1)

## 
## Call:
##  randomForest(formula = class ~ ., data = training, mtry = 8,      ntree = 10000, keep.forest = TRUE, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 10000
## No. of variables tried at each split: 8
## 
##         OOB estimate of  error rate: 26.67%
## Confusion matrix:
##         6 month acute class.error
## 6 month       2     7  0.77777778
## acute         1    20  0.04761905

The model did very well on the acute cases and actually predicted a few of the 6 month cases. Lets see how it predicts on the testing set of unseen data.

predictions2 <- predict(ML80.rf.tuned1, newdata=testing)
predictions2

##    23    27    34    37    39    44    45    85 
## acute acute acute acute acute acute acute acute 
## Levels: 6 month acute

model2classes <- data.frame(predictions2,type=testing$class)
model2classes

##    predictions2    type
## 23        acute   acute
## 27        acute   acute
## 34        acute   acute
## 37        acute   acute
## 39        acute   acute
## 44        acute   acute
## 45        acute   acute
## 85        acute 6 month

All acute classes were predicted correctly but the only 6 month class was incorrectly predicted to be acute. That is a good introduction and data science project in understanding and reviewing how little nuances in factor attachment and classification with fewer classes as well as exploration of the randomForest package. Clearly the accuracy is much better but lets just see the accuracy measure for ourselves.

sum(model2classes$predictions2==model2classes$type)/length(model2classes$predictions2)

## [1] 0.875

Much better accuracy, so wonderful. Alright, well thanks so much for reading this exploration and data science hands on assignment.

big28K dataframe LymeDisease

Janis Harris

2026-01-03