Trying out the randomForest package without the caret package to see how it would do on the lyme disease data that we used earlier that had an imbalanced number of classes. You can read about the randomForest package at this link.
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:randomForest':
##
## combine
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
Make a big data of the lyme disease data to test out some functions in the randomForest package.
lyme <- read.csv('lymedisease_meanEqualsMedian99vars.csv', header=T, na.strings=c('',' ','na','NA'))
colnames(lyme)
## [1] "Gene"
## [2] "healthyControl_1"
## [3] "healthyControl_2"
## [4] "healthyControl_3"
## [5] "healthyControl_4"
## [6] "healthyControl_5"
## [7] "healthyControl_6"
## [8] "healthyControl_7"
## [9] "healthyControl_8"
## [10] "healthyControl_9"
## [11] "healthyControl_10"
## [12] "healthyControl_11"
## [13] "healthyControl_12"
## [14] "healthyControl_13"
## [15] "healthyControl_14"
## [16] "healthyControl_15"
## [17] "healthyControl_16"
## [18] "healthyControl_17"
## [19] "healthyControl_18"
## [20] "healthyControl_19"
## [21] "healthyControl_20"
## [22] "healthyControl_21"
## [23] "acuteLymeDisease_1"
## [24] "acuteLymeDisease_2"
## [25] "acuteLymeDisease_3"
## [26] "acuteLymeDisease_4"
## [27] "acuteLymeDisease_5"
## [28] "acuteLymeDisease_6"
## [29] "acuteLymeDisease_7"
## [30] "acuteLymeDisease_8"
## [31] "acuteLymeDisease_9"
## [32] "acuteLymeDisease_10"
## [33] "acuteLymeDisease_11"
## [34] "acuteLymeDisease_12"
## [35] "acuteLymeDisease_13"
## [36] "acuteLymeDisease_14"
## [37] "acuteLymeDisease_15"
## [38] "acuteLymeDisease_16"
## [39] "acuteLymeDisease_17"
## [40] "acuteLymeDisease_18"
## [41] "acuteLymeDisease_19"
## [42] "acuteLymeDisease_20"
## [43] "acuteLymeDisease_21"
## [44] "acuteLymeDisease_22"
## [45] "acuteLymeDisease_23"
## [46] "acuteLymeDisease_24"
## [47] "acuteLymeDisease_25"
## [48] "acuteLymeDisease_26"
## [49] "acuteLymeDisease_27"
## [50] "acuteLymeDisease_28"
## [51] "Antibodies_1month_1"
## [52] "Antibodies_1month_2"
## [53] "Antibodies_1month_3"
## [54] "Antibodies_1month_4"
## [55] "Antibodies_1month_5"
## [56] "Antibodies_1month_6"
## [57] "Antibodies_1month_7"
## [58] "Antibodies_1month_8"
## [59] "Antibodies_1month_9"
## [60] "Antibodies_1month_10"
## [61] "Antibodies_1month_11"
## [62] "Antibodies_1month_12"
## [63] "Antibodies_1month_13"
## [64] "Antibodies_1month_14"
## [65] "Antibodies_1month_15"
## [66] "Antibodies_1month_16"
## [67] "Antibodies_1month_17"
## [68] "Antibodies_1month_18"
## [69] "Antibodies_1month_19"
## [70] "Antibodies_1month_20"
## [71] "Antibodies_1month_21"
## [72] "Antibodies_1month_22"
## [73] "Antibodies_1month_23"
## [74] "Antibodies_1month_24"
## [75] "Antibodies_1month_25"
## [76] "Antibodies_1month_26"
## [77] "Antibodies_1month_27"
## [78] "Antibodies_6months_1"
## [79] "Antibodies_6months_2"
## [80] "Antibodies_6months_3"
## [81] "Antibodies_6months_4"
## [82] "Antibodies_6months_5"
## [83] "Antibodies_6months_6"
## [84] "Antibodies_6months_7"
## [85] "Antibodies_6months_8"
## [86] "Antibodies_6months_9"
## [87] "Antibodies_6months_10"
## [88] "healthy_Mean"
## [89] "acuteLymeDisease_Mean"
## [90] "antibodies_1month_Mean"
## [91] "antibodies_6month_Mean"
## [92] "acuteHealthy_foldChange"
## [93] "antibodies_1month_healthy_foldChange"
## [94] "antibodies_6month_healthy_foldchange"
## [95] "acute_vs_6month_foldchange"
## [96] "healthy_Median"
## [97] "acuteLymeDisease_Median"
## [98] "antibodies_1month_Median"
## [99] "antibodies_6month_Median"
Locate the file we just used here.
lymeClassOnly <- lyme[,1:87]
colnames(lymeClassOnly)
## [1] "Gene" "healthyControl_1" "healthyControl_2"
## [4] "healthyControl_3" "healthyControl_4" "healthyControl_5"
## [7] "healthyControl_6" "healthyControl_7" "healthyControl_8"
## [10] "healthyControl_9" "healthyControl_10" "healthyControl_11"
## [13] "healthyControl_12" "healthyControl_13" "healthyControl_14"
## [16] "healthyControl_15" "healthyControl_16" "healthyControl_17"
## [19] "healthyControl_18" "healthyControl_19" "healthyControl_20"
## [22] "healthyControl_21" "acuteLymeDisease_1" "acuteLymeDisease_2"
## [25] "acuteLymeDisease_3" "acuteLymeDisease_4" "acuteLymeDisease_5"
## [28] "acuteLymeDisease_6" "acuteLymeDisease_7" "acuteLymeDisease_8"
## [31] "acuteLymeDisease_9" "acuteLymeDisease_10" "acuteLymeDisease_11"
## [34] "acuteLymeDisease_12" "acuteLymeDisease_13" "acuteLymeDisease_14"
## [37] "acuteLymeDisease_15" "acuteLymeDisease_16" "acuteLymeDisease_17"
## [40] "acuteLymeDisease_18" "acuteLymeDisease_19" "acuteLymeDisease_20"
## [43] "acuteLymeDisease_21" "acuteLymeDisease_22" "acuteLymeDisease_23"
## [46] "acuteLymeDisease_24" "acuteLymeDisease_25" "acuteLymeDisease_26"
## [49] "acuteLymeDisease_27" "acuteLymeDisease_28" "Antibodies_1month_1"
## [52] "Antibodies_1month_2" "Antibodies_1month_3" "Antibodies_1month_4"
## [55] "Antibodies_1month_5" "Antibodies_1month_6" "Antibodies_1month_7"
## [58] "Antibodies_1month_8" "Antibodies_1month_9" "Antibodies_1month_10"
## [61] "Antibodies_1month_11" "Antibodies_1month_12" "Antibodies_1month_13"
## [64] "Antibodies_1month_14" "Antibodies_1month_15" "Antibodies_1month_16"
## [67] "Antibodies_1month_17" "Antibodies_1month_18" "Antibodies_1month_19"
## [70] "Antibodies_1month_20" "Antibodies_1month_21" "Antibodies_1month_22"
## [73] "Antibodies_1month_23" "Antibodies_1month_24" "Antibodies_1month_25"
## [76] "Antibodies_1month_26" "Antibodies_1month_27" "Antibodies_6months_1"
## [79] "Antibodies_6months_2" "Antibodies_6months_3" "Antibodies_6months_4"
## [82] "Antibodies_6months_5" "Antibodies_6months_6" "Antibodies_6months_7"
## [85] "Antibodies_6months_8" "Antibodies_6months_9" "Antibodies_6months_10"
classNames <- colnames(lymeClassOnly)
classNames
## [1] "Gene" "healthyControl_1" "healthyControl_2"
## [4] "healthyControl_3" "healthyControl_4" "healthyControl_5"
## [7] "healthyControl_6" "healthyControl_7" "healthyControl_8"
## [10] "healthyControl_9" "healthyControl_10" "healthyControl_11"
## [13] "healthyControl_12" "healthyControl_13" "healthyControl_14"
## [16] "healthyControl_15" "healthyControl_16" "healthyControl_17"
## [19] "healthyControl_18" "healthyControl_19" "healthyControl_20"
## [22] "healthyControl_21" "acuteLymeDisease_1" "acuteLymeDisease_2"
## [25] "acuteLymeDisease_3" "acuteLymeDisease_4" "acuteLymeDisease_5"
## [28] "acuteLymeDisease_6" "acuteLymeDisease_7" "acuteLymeDisease_8"
## [31] "acuteLymeDisease_9" "acuteLymeDisease_10" "acuteLymeDisease_11"
## [34] "acuteLymeDisease_12" "acuteLymeDisease_13" "acuteLymeDisease_14"
## [37] "acuteLymeDisease_15" "acuteLymeDisease_16" "acuteLymeDisease_17"
## [40] "acuteLymeDisease_18" "acuteLymeDisease_19" "acuteLymeDisease_20"
## [43] "acuteLymeDisease_21" "acuteLymeDisease_22" "acuteLymeDisease_23"
## [46] "acuteLymeDisease_24" "acuteLymeDisease_25" "acuteLymeDisease_26"
## [49] "acuteLymeDisease_27" "acuteLymeDisease_28" "Antibodies_1month_1"
## [52] "Antibodies_1month_2" "Antibodies_1month_3" "Antibodies_1month_4"
## [55] "Antibodies_1month_5" "Antibodies_1month_6" "Antibodies_1month_7"
## [58] "Antibodies_1month_8" "Antibodies_1month_9" "Antibodies_1month_10"
## [61] "Antibodies_1month_11" "Antibodies_1month_12" "Antibodies_1month_13"
## [64] "Antibodies_1month_14" "Antibodies_1month_15" "Antibodies_1month_16"
## [67] "Antibodies_1month_17" "Antibodies_1month_18" "Antibodies_1month_19"
## [70] "Antibodies_1month_20" "Antibodies_1month_21" "Antibodies_1month_22"
## [73] "Antibodies_1month_23" "Antibodies_1month_24" "Antibodies_1month_25"
## [76] "Antibodies_1month_26" "Antibodies_1month_27" "Antibodies_6months_1"
## [79] "Antibodies_6months_2" "Antibodies_6months_3" "Antibodies_6months_4"
## [82] "Antibodies_6months_5" "Antibodies_6months_6" "Antibodies_6months_7"
## [85] "Antibodies_6months_8" "Antibodies_6months_9" "Antibodies_6months_10"
healthyIndex <- grep('healthy',classNames)
acute <- grep('acute', classNames)
month1 <- grep('1month',classNames)
month6 <- grep('6month', classNames)
classNames[healthyIndex] <- 'healthy'
classNames[acute] <- 'acute'
classNames[month1] <- '1 month'
classNames[month6] <- '6 month'
classNames
## [1] "Gene" "healthy" "healthy" "healthy" "healthy" "healthy" "healthy"
## [8] "healthy" "healthy" "healthy" "healthy" "healthy" "healthy" "healthy"
## [15] "healthy" "healthy" "healthy" "healthy" "healthy" "healthy" "healthy"
## [22] "healthy" "acute" "acute" "acute" "acute" "acute" "acute"
## [29] "acute" "acute" "acute" "acute" "acute" "acute" "acute"
## [36] "acute" "acute" "acute" "acute" "acute" "acute" "acute"
## [43] "acute" "acute" "acute" "acute" "acute" "acute" "acute"
## [50] "acute" "1 month" "1 month" "1 month" "1 month" "1 month" "1 month"
## [57] "1 month" "1 month" "1 month" "1 month" "1 month" "1 month" "1 month"
## [64] "1 month" "1 month" "1 month" "1 month" "1 month" "1 month" "1 month"
## [71] "1 month" "1 month" "1 month" "1 month" "1 month" "1 month" "1 month"
## [78] "6 month" "6 month" "6 month" "6 month" "6 month" "6 month" "6 month"
## [85] "6 month" "6 month" "6 month"
lymeByClass <- lymeClassOnly
colnames(lymeByClass) <- classNames
colnames(lymeByClass)
## [1] "Gene" "healthy" "healthy" "healthy" "healthy" "healthy" "healthy"
## [8] "healthy" "healthy" "healthy" "healthy" "healthy" "healthy" "healthy"
## [15] "healthy" "healthy" "healthy" "healthy" "healthy" "healthy" "healthy"
## [22] "healthy" "acute" "acute" "acute" "acute" "acute" "acute"
## [29] "acute" "acute" "acute" "acute" "acute" "acute" "acute"
## [36] "acute" "acute" "acute" "acute" "acute" "acute" "acute"
## [43] "acute" "acute" "acute" "acute" "acute" "acute" "acute"
## [50] "acute" "1 month" "1 month" "1 month" "1 month" "1 month" "1 month"
## [57] "1 month" "1 month" "1 month" "1 month" "1 month" "1 month" "1 month"
## [64] "1 month" "1 month" "1 month" "1 month" "1 month" "1 month" "1 month"
## [71] "1 month" "1 month" "1 month" "1 month" "1 month" "1 month" "1 month"
## [78] "6 month" "6 month" "6 month" "6 month" "6 month" "6 month" "6 month"
## [85] "6 month" "6 month" "6 month"
lymeByClass1 <- lymeByClass[complete.cases(lymeByClass),]
dim(lymeByClass1)
## [1] 19525 87
Only one case wasn’t complete so it was removed from the genes.
Make all genes as features as a very wide or fat dataframe that can really slow or shut down some processes.
big20kDF <- data.frame(t(lymeByClass))
#head(big20kDF) #this printed out a very long document had to remove it
big20kDF[1:5,1:5]
## X1 X2 X3 X4 X5
## Gene A1BG A1CF A2BP1 A2LD1 A2M
## healthy -1.517870e-01 1.005733e-02 1.748806e-01 -3.287320e-01 -4.190273e-01
## healthy.1 -0.1186983600 -0.4603165350 0.1099771518 -0.2406292000 0.4414801600
## healthy.2 0.0801501300 0.4677242000 0.5472002982 -0.2155766500 1.1160240000
## healthy.3 -3.132427e-01 1.544443e-01 -1.674776e-01 -3.081846e-01 1.927662e-01
colnames(big20kDF) <-lymeByClass$Gene
Big20K_df <- big20kDF[-1,]
#head(Big20K_df)
Big20K_df[1:5,1:5]
## A1BG A1CF A2BP1 A2LD1 A2M
## healthy -1.517870e-01 1.005733e-02 1.748806e-01 -3.287320e-01 -4.190273e-01
## healthy.1 -0.1186983600 -0.4603165350 0.1099771518 -0.2406292000 0.4414801600
## healthy.2 0.0801501300 0.4677242000 0.5472002982 -0.2155766500 1.1160240000
## healthy.3 -3.132427e-01 1.544443e-01 -1.674776e-01 -3.081846e-01 1.927662e-01
## healthy.4 -4.757977e-02 -6.056571e-02 -1.134319e-01 -1.514206e-01 1.730504e-01
dim(Big20K_df)
## [1] 86 19526
Big20K_df$class <- classNames[2:87]
Big20K_df[1:5,1:5]
## A1BG A1CF A2BP1 A2LD1 A2M
## healthy -1.517870e-01 1.005733e-02 1.748806e-01 -3.287320e-01 -4.190273e-01
## healthy.1 -0.1186983600 -0.4603165350 0.1099771518 -0.2406292000 0.4414801600
## healthy.2 0.0801501300 0.4677242000 0.5472002982 -0.2155766500 1.1160240000
## healthy.3 -3.132427e-01 1.544443e-01 -1.674776e-01 -3.081846e-01 1.927662e-01
## healthy.4 -4.757977e-02 -6.056571e-02 -1.134319e-01 -1.514206e-01 1.730504e-01
write.csv(Big20K_df,'fat20k_Lyme_DF.csv', row.names=F)
Google sheets said file is too large to upload. Was able to write it out but it failed on a ‘function’ mode when reading in.
We will just try one run of the randomForest functions to see if it can pull up most important features with this very large data frame.
acute6monthFC <- lyme[order(lyme$acute_vs_6month_foldchange, decreasing=T),]
colnames(acute6monthFC)
## [1] "Gene"
## [2] "healthyControl_1"
## [3] "healthyControl_2"
## [4] "healthyControl_3"
## [5] "healthyControl_4"
## [6] "healthyControl_5"
## [7] "healthyControl_6"
## [8] "healthyControl_7"
## [9] "healthyControl_8"
## [10] "healthyControl_9"
## [11] "healthyControl_10"
## [12] "healthyControl_11"
## [13] "healthyControl_12"
## [14] "healthyControl_13"
## [15] "healthyControl_14"
## [16] "healthyControl_15"
## [17] "healthyControl_16"
## [18] "healthyControl_17"
## [19] "healthyControl_18"
## [20] "healthyControl_19"
## [21] "healthyControl_20"
## [22] "healthyControl_21"
## [23] "acuteLymeDisease_1"
## [24] "acuteLymeDisease_2"
## [25] "acuteLymeDisease_3"
## [26] "acuteLymeDisease_4"
## [27] "acuteLymeDisease_5"
## [28] "acuteLymeDisease_6"
## [29] "acuteLymeDisease_7"
## [30] "acuteLymeDisease_8"
## [31] "acuteLymeDisease_9"
## [32] "acuteLymeDisease_10"
## [33] "acuteLymeDisease_11"
## [34] "acuteLymeDisease_12"
## [35] "acuteLymeDisease_13"
## [36] "acuteLymeDisease_14"
## [37] "acuteLymeDisease_15"
## [38] "acuteLymeDisease_16"
## [39] "acuteLymeDisease_17"
## [40] "acuteLymeDisease_18"
## [41] "acuteLymeDisease_19"
## [42] "acuteLymeDisease_20"
## [43] "acuteLymeDisease_21"
## [44] "acuteLymeDisease_22"
## [45] "acuteLymeDisease_23"
## [46] "acuteLymeDisease_24"
## [47] "acuteLymeDisease_25"
## [48] "acuteLymeDisease_26"
## [49] "acuteLymeDisease_27"
## [50] "acuteLymeDisease_28"
## [51] "Antibodies_1month_1"
## [52] "Antibodies_1month_2"
## [53] "Antibodies_1month_3"
## [54] "Antibodies_1month_4"
## [55] "Antibodies_1month_5"
## [56] "Antibodies_1month_6"
## [57] "Antibodies_1month_7"
## [58] "Antibodies_1month_8"
## [59] "Antibodies_1month_9"
## [60] "Antibodies_1month_10"
## [61] "Antibodies_1month_11"
## [62] "Antibodies_1month_12"
## [63] "Antibodies_1month_13"
## [64] "Antibodies_1month_14"
## [65] "Antibodies_1month_15"
## [66] "Antibodies_1month_16"
## [67] "Antibodies_1month_17"
## [68] "Antibodies_1month_18"
## [69] "Antibodies_1month_19"
## [70] "Antibodies_1month_20"
## [71] "Antibodies_1month_21"
## [72] "Antibodies_1month_22"
## [73] "Antibodies_1month_23"
## [74] "Antibodies_1month_24"
## [75] "Antibodies_1month_25"
## [76] "Antibodies_1month_26"
## [77] "Antibodies_1month_27"
## [78] "Antibodies_6months_1"
## [79] "Antibodies_6months_2"
## [80] "Antibodies_6months_3"
## [81] "Antibodies_6months_4"
## [82] "Antibodies_6months_5"
## [83] "Antibodies_6months_6"
## [84] "Antibodies_6months_7"
## [85] "Antibodies_6months_8"
## [86] "Antibodies_6months_9"
## [87] "Antibodies_6months_10"
## [88] "healthy_Mean"
## [89] "acuteLymeDisease_Mean"
## [90] "antibodies_1month_Mean"
## [91] "antibodies_6month_Mean"
## [92] "acuteHealthy_foldChange"
## [93] "antibodies_1month_healthy_foldChange"
## [94] "antibodies_6month_healthy_foldchange"
## [95] "acute_vs_6month_foldchange"
## [96] "healthy_Median"
## [97] "acuteLymeDisease_Median"
## [98] "antibodies_1month_Median"
## [99] "antibodies_6month_Median"
Take top40 and bottom 40 genes arrange from fold change values of the acute samples vs the samples already infected 6 months.
acuteVs6monthTop40 <- acute6monthFC[c(1:40,19487:19526),c(1:87,95)]
head(acuteVs6monthTop40$acute_vs_6month_foldchange,40)
## [1] 39592.5065 26949.4605 3013.6402 2340.4274 2252.7837 1667.6541
## [7] 1667.5040 1390.4131 1087.7278 959.0077 955.0899 819.7076
## [13] 814.2625 784.6073 757.5879 687.6496 660.1728 636.0464
## [19] 592.8499 579.0681 568.3394 556.5235 507.4635 502.1060
## [25] 415.4412 401.5336 385.7999 380.0250 376.6654 320.0938
## [31] 317.9930 313.9046 245.4455 214.5356 203.8399 200.7999
## [37] 192.8368 184.9654 180.9991 175.9187
tail(acuteVs6monthTop40$acute_vs_6month_foldchange,40)
## [1] -138.4930 -147.4067 -147.4130 -148.3167 -157.0114 -157.8041
## [7] -159.4937 -165.7433 -172.9144 -178.5105 -188.8205 -197.4117
## [13] -197.8417 -219.1352 -220.8985 -221.6484 -225.3240 -235.9024
## [19] -258.8922 -260.3979 -268.6351 -312.2450 -377.6045 -423.9474
## [25] -449.6177 -496.0992 -543.2492 -583.6196 -589.4268 -732.5246
## [31] -818.2834 -1111.6293 -1166.7017 -1202.1031 -1208.8211 -1363.4545
## [37] -1473.4662 -3152.9187 -4837.2160 -5134.0986
acute6monthDF <- acuteVs6monthTop40[,-88]
genes80 <- acute6monthDF$Gene
genes80
## [1] "NEK6" "HECTD1" "PDE4DIP" "POU2F2"
## [5] "TMX3" "KAT5" "PLCXD1" "MPDU1"
## [9] "PCSK1N" "TCEAL4" "KRT36" "HSD11B1"
## [13] "LOC100287314" "CHGA" "STK38L" "BRD3"
## [17] "TMSB4Y" "GGT7" "RNF217" "ALAS2"
## [21] "LILRA6" "ADRBK2" "TAS2R3" "C1orf51"
## [25] "CFTR" "HS3ST3B1" "LRRCC1" "MPO"
## [29] "LOC653562" "TPBG" "SCN11A" "VSIG8"
## [33] "CD2AP" "FLJ23867" "BEGAIN" "MGC34800"
## [37] "P2RX7" "WDR46" "GSC" "C3orf42"
## [41] "MTMR14" "FBXO48" "H2AFJ" "KRTAP10-7"
## [45] "SNORA74A" "RASAL1" "ENSG00000229196" "FBXO9"
## [49] "SOX10" "C1orf127" "F13A1" "OR10A3"
## [53] "LOC145757" "FAT2" "OOEP" "EIF2C2"
## [57] "MMP13" "MAP2K3" "BRSK1" "RHOBTB2"
## [61] "Septin 6" "STRA8" "ATP2C1" "OR14A16"
## [65] "AHSG" "ORC6L" "TLX1NB" "MTHFD2"
## [69] "GMPPB" "RGPD3" "FXR1" "GJA5"
## [73] "ZNF836" "C16orf48" "DNAJC24" "TSHR"
## [77] "MTA3" "NDST3" "OR12D2" "MKL1"
DF80 <- data.frame(t(acute6monthDF))
colnames(DF80) <- genes80
DF80$class <- classNames
DF80_ML <- DF80[-1,]
write.csv(DF80_ML, 'ML_lyme_DF_86X81.csv', row.names=F)
You can access this file here.
clear environment of all variables.
rm(list=ls())
Read in machine learning data frame of 80 genes as features to predict class of 86 classes. This will prevent the rows from having names.
ML80 <- read.csv('ML_lyme_DF_86X81.csv', header=T, na.strings=c('',' ','na','NA'))
Lets use the cross validation function in randomForest to get important features.
str(ML80)
## 'data.frame': 86 obs. of 81 variables:
## $ NEK6 : num -0.3508 -0.1491 -0.0436 0.5418 -0.4119 ...
## $ HECTD1 : num -0.302 0.521 -1.802 0.547 0.531 ...
## $ PDE4DIP : num -0.5338 -0.111 0.04 -0.0543 0.0395 ...
## $ POU2F2 : num -0.305 -0.258 -0.0195 -0.0769 -0.1503 ...
## $ TMX3 : num -0.0214 0.2596 -0.61 0.234 0.4109 ...
## $ KAT5 : num 0.2 0.434 0.377 0.37 -0.159 ...
## $ PLCXD1 : num -0.5578 -0.5126 -0.334 0.0917 -0.4348 ...
## $ MPDU1 : num -0.31 -0.444 -0.122 -0.09 -0.354 ...
## $ PCSK1N : num -0.0774 -0.0476 0.7991 -0.0582 -0.3024 ...
## $ TCEAL4 : num 0.5637 -0.318 -1.2768 0.282 0.0424 ...
## $ KRT36 : num -0.05664 -0.199899 0.185492 0.000144 -0.155894 ...
## $ HSD11B1 : num -0.0724 0.0489 0.1243 -0.0985 -0.1971 ...
## $ LOC100287314 : num -0.1245 0.1503 -0.1013 0.0786 -0.1708 ...
## $ CHGA : num 0.0631 -0.2733 1.2062 -0.2496 -0.2776 ...
## $ STK38L : num -0.7317 0.0727 -0.3871 0.3623 0.0709 ...
## $ BRD3 : num 0.0203 0.2553 0.126 0.313 0.4716 ...
## $ TMSB4Y : num 0.3061 -0.5535 0.0623 -0.2572 0.2259 ...
## $ GGT7 : num -0.156 0.111 -0.0523 -0.4699 0.2977 ...
## $ RNF217 : num -0.315 1.019 -0.152 -0.205 0.305 ...
## $ ALAS2 : num 0.14 2.008 -0.517 -0.554 -0.762 ...
## $ LILRA6 : num -0.571 -0.804 -1.225 -0.696 -0.374 ...
## $ ADRBK2 : num -0.7166 -0.0239 -0.5431 0.0757 -0.2658 ...
## $ TAS2R3 : num 0.2325 0.1031 -0.036 -0.0213 -0.2399 ...
## $ C1orf51 : num 0.1973 -0.1743 0.2388 0.1867 -0.0457 ...
## $ CFTR : num 0.25 -0.272 0.751 -0.632 -0.32 ...
## $ HS3ST3B1 : num -0.336 0.323 0.194 0.392 0.282 ...
## $ LRRCC1 : num -0.0765 -0.0728 -0.0306 0.1316 0.0904 ...
## $ MPO : num -0.0461 -0.3983 -0.0862 -0.0156 -0.249 ...
## $ LOC653562 : num -0.158 1.295 0.837 -0.192 -0.407 ...
## $ TPBG : num -0.287 0.947 0.276 -0.4 -0.591 ...
## $ SCN11A : num -0.0107 0.0328 1.0342 -0.0841 -0.1941 ...
## $ VSIG8 : num 0.627 -0.198 -0.332 0.423 -0.334 ...
## $ CD2AP : num -0.0392 -0.1189 0.0636 0.1281 0.119 ...
## $ FLJ23867 : num -0.4395 0.4936 -0.0292 -0.3027 0.0265 ...
## $ BEGAIN : num -0.3016 -0.1186 1.0694 -0.0139 -0.2875 ...
## $ MGC34800 : num -0.0912 -0.2683 -0.0734 -0.019 -0.2323 ...
## $ P2RX7 : num -0.699 0.453 -0.534 -0.214 0.141 ...
## $ WDR46 : num 0.0299 -0.1665 -0.2608 -0.1359 -0.1444 ...
## $ GSC : num 0.3248 -0.1677 0.407 0.0255 0.3996 ...
## $ C3orf42 : num 0.0669 -0.1948 0.4071 0.0616 -0.4221 ...
## $ MTMR14 : num -0.0949 -0.0122 -0.1157 0.1285 -0.0997 ...
## $ FBXO48 : num 0.173 -0.098 -0.393 0.239 -0.116 ...
## $ H2AFJ : num -0.633 -0.0892 -0.715 0.1485 0.3033 ...
## $ KRTAP10.7 : num 0.734 0.257 1.111 -0.148 -0.135 ...
## $ SNORA74A : num 0.676 0.139 0.872 0.114 -0.38 ...
## $ RASAL1 : num 0.11742 -0.00149 -0.02482 -0.23536 0.028 ...
## $ ENSG00000229196: num 0.75822 -0.00579 0.93324 -0.02652 0.18702 ...
## $ FBXO9 : num 0.0192 -0.0072 -0.6633 -0.0984 0.025 ...
## $ SOX10 : num 0.0862 -0.1704 -0.1274 -0.0798 -0.3639 ...
## $ C1orf127 : num 0.246 0.336 1.465 -0.343 -0.128 ...
## $ F13A1 : num -0.605 0.36 -0.218 -0.598 -0.223 ...
## $ OR10A3 : num -0.06915 0.05162 0.24496 0.00137 -0.28854 ...
## $ LOC145757 : num -0.0569 -0.04769 0.00873 -0.09158 -0.1248 ...
## $ FAT2 : num 0.1919 -0.3219 0.3047 -0.0593 -0.1934 ...
## $ OOEP : num 0.177491 -0.000566 0.668561 0.199253 -0.173194 ...
## $ EIF2C2 : num 0.4964 0.2373 -0.7133 0.3628 -0.0516 ...
## $ MMP13 : num 0.0434 -0.0802 0.2907 -0.0312 -0.1319 ...
## $ MAP2K3 : num -0.7315 0.0363 0.1161 -0.0728 -0.1372 ...
## $ BRSK1 : num -0.604 0.0353 0.3551 -0.164 -0.3456 ...
## $ RHOBTB2 : num 0.1453 0.1134 0.6744 0.1848 0.0606 ...
## $ Septin.6 : num -0.1437 -0.3411 -0.0162 -0.4229 -0.0351 ...
## $ STRA8 : num 0.8642 0.0254 0.3683 0.3999 -0.0766 ...
## $ ATP2C1 : num -0.361 0.125 -0.659 0.119 0.149 ...
## $ OR14A16 : num 0.0286 -0.0641 0.4736 0.0261 -0.3375 ...
## $ AHSG : num 0.192 -0.22 0.92 -0.15 -0.277 ...
## $ ORC6L : num 0.19 -1.369 -0.131 -0.742 0.143 ...
## $ TLX1NB : num 0.502 -0.256 0.807 -0.475 -0.145 ...
## $ MTHFD2 : num 0.35248 0.31067 -1.05314 -0.01933 0.00216 ...
## $ GMPPB : num 0.0119 0.3596 0.5872 0.3397 0.0886 ...
## $ RGPD3 : num -0.000246 0.262285 -0.040798 -0.375937 0.190607 ...
## $ FXR1 : num 0.012 0.4993 -0.3251 0.0827 0.3152 ...
## $ GJA5 : num 0.3345 0.1924 0.4793 -0.0485 -0.1555 ...
## $ ZNF836 : num 0.0683 -0.0257 0.1888 -0.057 -0.1004 ...
## $ C16orf48 : num 0.2156 -0.0478 0.4009 0.3981 -0.21 ...
## $ DNAJC24 : num -0.4986 0.0249 -0.5918 0.3067 -0.0755 ...
## $ TSHR : num -0.056 0.1009 0.3179 0.0243 -0.0942 ...
## $ MTA3 : num 0.239 -0.491 -0.483 0.221 0.871 ...
## $ NDST3 : num -0.0178 0.0633 0.1542 -0.064 -0.0408 ...
## $ OR12D2 : num 0.2254 -0.1449 0.066 0.135 -0.0408 ...
## $ MKL1 : num -0.33057 0.12237 0.00884 0.07873 -0.08919 ...
## $ class : chr "healthy" "healthy" "healthy" "healthy" ...
ML80$class <- as.factor(ML80$class)
str(ML80)
## 'data.frame': 86 obs. of 81 variables:
## $ NEK6 : num -0.3508 -0.1491 -0.0436 0.5418 -0.4119 ...
## $ HECTD1 : num -0.302 0.521 -1.802 0.547 0.531 ...
## $ PDE4DIP : num -0.5338 -0.111 0.04 -0.0543 0.0395 ...
## $ POU2F2 : num -0.305 -0.258 -0.0195 -0.0769 -0.1503 ...
## $ TMX3 : num -0.0214 0.2596 -0.61 0.234 0.4109 ...
## $ KAT5 : num 0.2 0.434 0.377 0.37 -0.159 ...
## $ PLCXD1 : num -0.5578 -0.5126 -0.334 0.0917 -0.4348 ...
## $ MPDU1 : num -0.31 -0.444 -0.122 -0.09 -0.354 ...
## $ PCSK1N : num -0.0774 -0.0476 0.7991 -0.0582 -0.3024 ...
## $ TCEAL4 : num 0.5637 -0.318 -1.2768 0.282 0.0424 ...
## $ KRT36 : num -0.05664 -0.199899 0.185492 0.000144 -0.155894 ...
## $ HSD11B1 : num -0.0724 0.0489 0.1243 -0.0985 -0.1971 ...
## $ LOC100287314 : num -0.1245 0.1503 -0.1013 0.0786 -0.1708 ...
## $ CHGA : num 0.0631 -0.2733 1.2062 -0.2496 -0.2776 ...
## $ STK38L : num -0.7317 0.0727 -0.3871 0.3623 0.0709 ...
## $ BRD3 : num 0.0203 0.2553 0.126 0.313 0.4716 ...
## $ TMSB4Y : num 0.3061 -0.5535 0.0623 -0.2572 0.2259 ...
## $ GGT7 : num -0.156 0.111 -0.0523 -0.4699 0.2977 ...
## $ RNF217 : num -0.315 1.019 -0.152 -0.205 0.305 ...
## $ ALAS2 : num 0.14 2.008 -0.517 -0.554 -0.762 ...
## $ LILRA6 : num -0.571 -0.804 -1.225 -0.696 -0.374 ...
## $ ADRBK2 : num -0.7166 -0.0239 -0.5431 0.0757 -0.2658 ...
## $ TAS2R3 : num 0.2325 0.1031 -0.036 -0.0213 -0.2399 ...
## $ C1orf51 : num 0.1973 -0.1743 0.2388 0.1867 -0.0457 ...
## $ CFTR : num 0.25 -0.272 0.751 -0.632 -0.32 ...
## $ HS3ST3B1 : num -0.336 0.323 0.194 0.392 0.282 ...
## $ LRRCC1 : num -0.0765 -0.0728 -0.0306 0.1316 0.0904 ...
## $ MPO : num -0.0461 -0.3983 -0.0862 -0.0156 -0.249 ...
## $ LOC653562 : num -0.158 1.295 0.837 -0.192 -0.407 ...
## $ TPBG : num -0.287 0.947 0.276 -0.4 -0.591 ...
## $ SCN11A : num -0.0107 0.0328 1.0342 -0.0841 -0.1941 ...
## $ VSIG8 : num 0.627 -0.198 -0.332 0.423 -0.334 ...
## $ CD2AP : num -0.0392 -0.1189 0.0636 0.1281 0.119 ...
## $ FLJ23867 : num -0.4395 0.4936 -0.0292 -0.3027 0.0265 ...
## $ BEGAIN : num -0.3016 -0.1186 1.0694 -0.0139 -0.2875 ...
## $ MGC34800 : num -0.0912 -0.2683 -0.0734 -0.019 -0.2323 ...
## $ P2RX7 : num -0.699 0.453 -0.534 -0.214 0.141 ...
## $ WDR46 : num 0.0299 -0.1665 -0.2608 -0.1359 -0.1444 ...
## $ GSC : num 0.3248 -0.1677 0.407 0.0255 0.3996 ...
## $ C3orf42 : num 0.0669 -0.1948 0.4071 0.0616 -0.4221 ...
## $ MTMR14 : num -0.0949 -0.0122 -0.1157 0.1285 -0.0997 ...
## $ FBXO48 : num 0.173 -0.098 -0.393 0.239 -0.116 ...
## $ H2AFJ : num -0.633 -0.0892 -0.715 0.1485 0.3033 ...
## $ KRTAP10.7 : num 0.734 0.257 1.111 -0.148 -0.135 ...
## $ SNORA74A : num 0.676 0.139 0.872 0.114 -0.38 ...
## $ RASAL1 : num 0.11742 -0.00149 -0.02482 -0.23536 0.028 ...
## $ ENSG00000229196: num 0.75822 -0.00579 0.93324 -0.02652 0.18702 ...
## $ FBXO9 : num 0.0192 -0.0072 -0.6633 -0.0984 0.025 ...
## $ SOX10 : num 0.0862 -0.1704 -0.1274 -0.0798 -0.3639 ...
## $ C1orf127 : num 0.246 0.336 1.465 -0.343 -0.128 ...
## $ F13A1 : num -0.605 0.36 -0.218 -0.598 -0.223 ...
## $ OR10A3 : num -0.06915 0.05162 0.24496 0.00137 -0.28854 ...
## $ LOC145757 : num -0.0569 -0.04769 0.00873 -0.09158 -0.1248 ...
## $ FAT2 : num 0.1919 -0.3219 0.3047 -0.0593 -0.1934 ...
## $ OOEP : num 0.177491 -0.000566 0.668561 0.199253 -0.173194 ...
## $ EIF2C2 : num 0.4964 0.2373 -0.7133 0.3628 -0.0516 ...
## $ MMP13 : num 0.0434 -0.0802 0.2907 -0.0312 -0.1319 ...
## $ MAP2K3 : num -0.7315 0.0363 0.1161 -0.0728 -0.1372 ...
## $ BRSK1 : num -0.604 0.0353 0.3551 -0.164 -0.3456 ...
## $ RHOBTB2 : num 0.1453 0.1134 0.6744 0.1848 0.0606 ...
## $ Septin.6 : num -0.1437 -0.3411 -0.0162 -0.4229 -0.0351 ...
## $ STRA8 : num 0.8642 0.0254 0.3683 0.3999 -0.0766 ...
## $ ATP2C1 : num -0.361 0.125 -0.659 0.119 0.149 ...
## $ OR14A16 : num 0.0286 -0.0641 0.4736 0.0261 -0.3375 ...
## $ AHSG : num 0.192 -0.22 0.92 -0.15 -0.277 ...
## $ ORC6L : num 0.19 -1.369 -0.131 -0.742 0.143 ...
## $ TLX1NB : num 0.502 -0.256 0.807 -0.475 -0.145 ...
## $ MTHFD2 : num 0.35248 0.31067 -1.05314 -0.01933 0.00216 ...
## $ GMPPB : num 0.0119 0.3596 0.5872 0.3397 0.0886 ...
## $ RGPD3 : num -0.000246 0.262285 -0.040798 -0.375937 0.190607 ...
## $ FXR1 : num 0.012 0.4993 -0.3251 0.0827 0.3152 ...
## $ GJA5 : num 0.3345 0.1924 0.4793 -0.0485 -0.1555 ...
## $ ZNF836 : num 0.0683 -0.0257 0.1888 -0.057 -0.1004 ...
## $ C16orf48 : num 0.2156 -0.0478 0.4009 0.3981 -0.21 ...
## $ DNAJC24 : num -0.4986 0.0249 -0.5918 0.3067 -0.0755 ...
## $ TSHR : num -0.056 0.1009 0.3179 0.0243 -0.0942 ...
## $ MTA3 : num 0.239 -0.491 -0.483 0.221 0.871 ...
## $ NDST3 : num -0.0178 0.0633 0.1542 -0.064 -0.0408 ...
## $ OR12D2 : num 0.2254 -0.1449 0.066 0.135 -0.0408 ...
## $ MKL1 : num -0.33057 0.12237 0.00884 0.07873 -0.08919 ...
## $ class : Factor w/ 4 levels "1 month","6 month",..: 4 4 4 4 4 4 4 4 4 4 ...
set.seed(123)
intrain <- sample(1:86,.8*86)
training <- ML80[intrain,]
testing <- ML80[-intrain,]
result <- rfcv(training[,-81], training[,81], cv.fold=3)
print(result)
## $n.var
## [1] 80 40 20 10 5 2 1
##
## $error.cv
## 80 40 20 10 5 2 1
## 0.5294118 0.5000000 0.4852941 0.5294118 0.5882353 0.7058824 0.6470588
##
## $predicted
## $predicted$`80`
## [1] acute acute acute 6 month 1 month acute 1 month 1 month healthy
## [10] 1 month acute acute healthy healthy acute healthy 1 month acute
## [19] acute 1 month 1 month acute 1 month acute acute acute acute
## [28] healthy acute healthy healthy acute acute healthy acute acute
## [37] acute acute acute acute healthy healthy 1 month healthy 1 month
## [46] 1 month acute healthy healthy healthy 1 month acute healthy 1 month
## [55] healthy 1 month 1 month acute acute acute acute healthy 6 month
## [64] 1 month acute healthy 1 month acute
## Levels: 1 month 6 month acute healthy
##
## $predicted$`40`
## [1] acute acute acute 1 month 1 month acute 1 month acute healthy
## [10] 1 month acute acute healthy healthy acute healthy healthy acute
## [19] acute 1 month healthy acute 1 month acute 1 month acute acute
## [28] healthy acute healthy healthy acute 1 month healthy acute acute
## [37] healthy acute acute acute healthy healthy 1 month healthy healthy
## [46] 1 month acute healthy healthy healthy healthy acute healthy 1 month
## [55] 1 month 6 month 1 month acute acute acute acute healthy 6 month
## [64] 1 month acute healthy 1 month acute
## Levels: 1 month 6 month acute healthy
##
## $predicted$`20`
## [1] acute acute acute healthy 1 month acute 1 month acute healthy
## [10] 1 month acute acute healthy 6 month acute healthy healthy 1 month
## [19] acute healthy healthy acute 1 month acute healthy healthy acute
## [28] healthy acute healthy healthy acute 1 month healthy acute acute
## [37] 1 month acute acute acute healthy healthy 1 month healthy healthy
## [46] 1 month acute healthy healthy healthy healthy acute healthy 1 month
## [55] 1 month acute 1 month acute acute acute healthy healthy acute
## [64] 1 month acute healthy 1 month acute
## Levels: 1 month 6 month acute healthy
##
## $predicted$`10`
## [1] acute acute 1 month 6 month 1 month acute 1 month acute healthy
## [10] 1 month acute acute healthy 6 month 1 month healthy healthy 1 month
## [19] acute healthy healthy acute 1 month acute healthy 6 month acute
## [28] healthy acute 1 month healthy acute acute healthy acute acute
## [37] 1 month acute acute acute healthy healthy acute healthy healthy
## [46] 1 month acute 6 month healthy healthy healthy acute healthy 1 month
## [55] 1 month acute 1 month acute healthy acute healthy healthy healthy
## [64] 1 month acute healthy 1 month acute
## Levels: 1 month 6 month acute healthy
##
## $predicted$`5`
## [1] acute acute 1 month healthy 1 month acute 1 month 1 month healthy
## [10] 1 month acute acute healthy healthy acute acute healthy 1 month
## [19] acute healthy healthy acute 1 month acute healthy 6 month acute
## [28] 1 month 1 month acute healthy acute 1 month healthy acute acute
## [37] acute 1 month acute acute healthy healthy 6 month healthy 1 month
## [46] 1 month acute healthy 1 month 6 month healthy acute healthy 1 month
## [55] 1 month acute 1 month acute 6 month acute healthy 6 month 6 month
## [64] 1 month acute healthy 1 month acute
## Levels: 1 month 6 month acute healthy
##
## $predicted$`2`
## [1] healthy acute 6 month healthy 1 month acute 1 month 6 month 1 month
## [10] acute acute acute 1 month healthy 1 month healthy 1 month 6 month
## [19] 6 month acute 1 month acute healthy healthy healthy 6 month acute
## [28] 1 month healthy healthy 1 month acute healthy healthy acute acute
## [37] acute acute acute acute healthy 6 month healthy acute 6 month
## [46] 1 month acute healthy 1 month acute acute acute 1 month 1 month
## [55] 6 month acute acute acute healthy acute healthy acute acute
## [64] healthy acute healthy 1 month 1 month
## Levels: 1 month 6 month acute healthy
##
## $predicted$`1`
## [1] 6 month healthy 6 month healthy 1 month acute healthy 6 month 6 month
## [10] acute acute acute 1 month healthy healthy healthy 1 month acute
## [19] acute acute 6 month acute 1 month 6 month 1 month 1 month healthy
## [28] healthy healthy healthy healthy acute 1 month healthy acute acute
## [37] healthy acute acute 1 month healthy healthy healthy acute 6 month
## [46] healthy acute healthy healthy acute healthy acute 1 month 6 month
## [55] 6 month 1 month healthy acute healthy acute healthy 1 month acute
## [64] healthy acute acute 6 month acute
## Levels: 1 month 6 month acute healthy
bestPredicted <- data.frame(predicted=result$predicted[3],type=training$class)
bestPredicted
## X20 type
## 1 acute acute
## 2 acute 6 month
## 3 acute 1 month
## 4 healthy healthy
## 5 1 month 1 month
## 6 acute acute
## 7 1 month 1 month
## 8 acute acute
## 9 healthy 6 month
## 10 1 month acute
## 11 acute 1 month
## 12 acute 1 month
## 13 healthy healthy
## 14 6 month 1 month
## 15 acute acute
## 16 healthy healthy
## 17 healthy 6 month
## 18 1 month 1 month
## 19 acute acute
## 20 healthy 6 month
## 21 healthy healthy
## 22 acute acute
## 23 1 month 1 month
## 24 acute acute
## 25 healthy 1 month
## 26 healthy 1 month
## 27 acute acute
## 28 healthy healthy
## 29 acute acute
## 30 healthy acute
## 31 healthy 1 month
## 32 acute 1 month
## 33 1 month 1 month
## 34 healthy 1 month
## 35 acute 1 month
## 36 acute acute
## 37 1 month 6 month
## 38 acute acute
## 39 acute acute
## 40 acute healthy
## 41 healthy healthy
## 42 healthy healthy
## 43 1 month healthy
## 44 healthy healthy
## 45 healthy acute
## 46 1 month 1 month
## 47 acute 6 month
## 48 healthy 6 month
## 49 healthy healthy
## 50 healthy 1 month
## 51 healthy 1 month
## 52 acute 6 month
## 53 healthy healthy
## 54 1 month acute
## 55 1 month healthy
## 56 acute healthy
## 57 1 month acute
## 58 acute acute
## 59 acute acute
## 60 acute 1 month
## 61 healthy acute
## 62 healthy healthy
## 63 acute acute
## 64 1 month 6 month
## 65 acute acute
## 66 healthy healthy
## 67 1 month healthy
## 68 acute acute
sum(bestPredicted$X20==bestPredicted$type)/length(bestPredicted$X20)
## [1] 0.5147059
This used 20 features and got the least error but results in accuracy are terrible. Now the actual random forest function in the package.
set.seed(123)
ML80.rf <- randomForest(class ~ ., data=training, ntree=10000, keep.forest=TRUE,
importance=TRUE)
print(ML80.rf)
##
## Call:
## randomForest(formula = class ~ ., data = training, ntree = 10000, keep.forest = TRUE, importance = TRUE)
## Type of random forest: classification
## Number of trees: 10000
## No. of variables tried at each split: 8
##
## OOB estimate of error rate: 55.88%
## Confusion matrix:
## 1 month 6 month acute healthy class.error
## 1 month 4 1 9 5 0.7894737
## 6 month 2 0 2 5 1.0000000
## acute 5 1 16 1 0.3043478
## healthy 3 2 2 10 0.4117647
The predictors are best at predicting the acute or healthy cases but not the 6 month at all. These were the top 80
variableImportance <- data.frame(ML80.rf$importance)
variableImportance
## X1.month X6.month acute healthy
## NEK6 1.974351e-04 -3.292857e-04 8.552170e-05 -2.851263e-04
## HECTD1 -1.284888e-03 7.248810e-04 4.198399e-05 2.918687e-04
## PDE4DIP 2.285871e-03 -1.536429e-03 1.265333e-02 9.851602e-03
## POU2F2 -5.980098e-04 -6.357143e-04 1.916839e-04 -9.329834e-04
## TMX3 4.191026e-04 1.692619e-03 1.882240e-04 -4.310281e-04
## KAT5 1.726551e-04 -1.708333e-03 -6.633264e-04 -1.354264e-03
## PLCXD1 -2.525278e-04 4.195119e-03 6.713683e-03 1.426802e-02
## MPDU1 -1.024509e-03 5.993571e-03 2.997404e-03 5.952554e-03
## PCSK1N 5.730988e-04 -1.069167e-03 3.408525e-04 9.876154e-04
## TCEAL4 2.320814e-03 9.238095e-04 2.483950e-03 -3.909632e-04
## KRT36 -8.934488e-04 -1.036667e-03 4.749023e-04 4.627706e-05
## HSD11B1 -6.795768e-04 -9.776190e-04 -1.996923e-04 -5.026154e-04
## LOC100287314 8.642424e-04 -1.985238e-03 3.275191e-04 9.191198e-04
## CHGA -1.352977e-03 7.965714e-03 -3.401071e-05 -8.655700e-04
## STK38L -8.818975e-04 -1.061667e-03 -4.119408e-04 -3.406746e-04
## BRD3 -7.213731e-04 -7.416667e-05 -5.559645e-04 2.725476e-03
## TMSB4Y -9.037410e-04 5.086667e-03 5.909507e-05 3.516609e-03
## GGT7 -9.113215e-04 -1.216667e-03 -5.410847e-04 -2.234416e-04
## RNF217 4.488864e-04 -2.040000e-03 -7.838764e-04 7.766703e-04
## ALAS2 -1.188402e-03 -6.626190e-04 1.023624e-03 -6.522319e-04
## LILRA6 2.839207e-03 -1.080000e-03 3.377771e-02 1.424508e-02
## ADRBK2 8.962796e-04 -2.159286e-03 2.883323e-03 -1.064711e-03
## TAS2R3 -1.385606e-04 1.522381e-03 -5.549728e-04 -9.244372e-04
## C1orf51 -8.283270e-04 -7.583333e-04 -6.415321e-04 -9.344877e-04
## CFTR -4.741234e-04 -7.033333e-04 -5.652926e-04 -1.250794e-04
## HS3ST3B1 -5.901623e-04 3.150000e-04 1.663434e-04 3.888889e-05
## LRRCC1 -5.834443e-06 -8.833333e-04 -4.550325e-04 1.394369e-03
## MPO -3.530254e-03 -2.317262e-03 6.070642e-03 6.943828e-03
## LOC653562 -9.442713e-04 3.359524e-04 -1.212044e-04 -2.877020e-04
## TPBG -5.254365e-04 -1.966667e-04 -1.015548e-03 -4.573413e-04
## SCN11A -5.679762e-04 6.366667e-04 -3.093726e-04 3.642821e-04
## VSIG8 -2.573088e-04 -6.833333e-05 -4.862388e-04 -4.927350e-05
## CD2AP -5.925214e-05 -1.200952e-03 -6.194797e-04 3.993290e-04
## FLJ23867 9.941295e-04 -2.634286e-03 6.876315e-04 9.727237e-03
## BEGAIN 2.046520e-03 1.032619e-03 -6.999170e-04 -1.260325e-03
## MGC34800 1.626015e-03 -9.261905e-05 -4.369722e-04 -8.397294e-04
## P2RX7 -2.874459e-05 1.012619e-03 8.740327e-03 1.433532e-03
## WDR46 -1.815771e-03 -1.620952e-03 1.600214e-03 5.361223e-03
## GSC -3.374747e-04 1.760952e-03 1.772078e-04 3.808730e-04
## C3orf42 -5.017205e-05 -3.445238e-04 -4.190820e-04 -4.230483e-04
## MTMR14 -6.035642e-04 2.114762e-03 2.300362e-03 -1.238045e-03
## FBXO48 -6.107157e-04 2.997619e-04 -3.655375e-04 1.099531e-04
## H2AFJ 7.025000e-04 3.464286e-04 2.823888e-03 4.744148e-03
## KRTAP10.7 -7.190249e-04 3.585000e-03 -5.933081e-04 -7.578860e-04
## SNORA74A 6.677861e-04 -2.249405e-03 8.034345e-03 1.314221e-03
## RASAL1 -8.204906e-05 -3.602381e-04 -3.286880e-04 -5.424603e-04
## ENSG00000229196 -1.455999e-03 -2.876190e-04 -2.684527e-04 -2.683983e-04
## FBXO9 -7.120815e-04 5.738095e-05 4.890390e-04 3.212845e-03
## SOX10 -1.261472e-04 -1.383333e-04 -3.349062e-04 -5.091270e-05
## C1orf127 -3.785354e-05 -2.035714e-04 -3.317757e-04 -2.548088e-04
## F13A1 -2.749952e-03 5.108333e-03 7.095081e-03 -1.239899e-04
## OR10A3 -8.127500e-04 -5.433333e-04 -4.744986e-04 -3.907143e-04
## LOC145757 5.150627e-04 -2.045000e-03 1.842774e-04 4.520613e-03
## FAT2 -5.146861e-04 1.788333e-03 -2.044134e-04 7.615909e-04
## OOEP -1.820346e-05 -8.666667e-04 -1.365008e-03 1.151263e-04
## EIF2C2 -5.535137e-04 4.002381e-04 2.278918e-04 7.535354e-04
## MMP13 -1.906566e-04 7.116667e-04 -4.410728e-05 -5.623918e-04
## MAP2K3 -1.434003e-03 7.590476e-04 1.898859e-03 8.121825e-04
## BRSK1 -1.404646e-03 2.861667e-03 1.299509e-02 1.171913e-02
## RHOBTB2 -7.349939e-04 -1.182619e-03 6.101537e-04 9.940476e-05
## Septin.6 1.609333e-03 1.366667e-04 -2.306087e-04 1.805234e-03
## STRA8 2.506507e-04 -7.066667e-04 -1.186548e-03 -6.760714e-04
## ATP2C1 -1.399531e-04 -1.095952e-03 -4.790132e-04 8.758297e-05
## OR14A16 -7.488614e-04 3.720000e-03 -5.020256e-04 -2.622583e-04
## AHSG 7.165260e-04 1.514762e-03 -1.113413e-03 4.628499e-04
## ORC6L -1.011624e-04 6.109524e-04 7.262432e-04 3.013564e-03
## TLX1NB -4.712698e-04 -7.566667e-04 -7.107035e-04 5.088492e-04
## MTHFD2 8.634740e-04 4.579762e-04 2.003276e-03 9.091270e-05
## GMPPB -7.762060e-04 3.300000e-04 4.077717e-04 -2.629113e-04
## RGPD3 -5.790801e-04 -1.823810e-04 -7.207035e-04 -2.659993e-04
## FXR1 -1.271634e-03 -6.480952e-04 1.074139e-03 -3.271104e-04
## GJA5 7.243193e-04 3.497619e-04 2.181167e-03 -1.640519e-03
## ZNF836 4.431843e-04 -1.108571e-03 -3.985792e-05 2.763492e-04
## C16orf48 -8.427128e-05 -7.935714e-04 -2.499215e-04 -2.623810e-04
## DNAJC24 2.164037e-02 -1.300119e-03 2.093149e-03 7.514390e-03
## TSHR -1.945666e-03 -2.695238e-04 3.921679e-03 -1.711876e-03
## MTA3 -6.852886e-04 1.571429e-05 -6.220635e-04 2.546501e-04
## NDST3 2.269300e-04 -1.046667e-03 -8.900794e-05 -6.222522e-04
## OR12D2 -4.028608e-04 -9.033333e-04 -6.600358e-04 -8.977886e-04
## MKL1 1.023160e-04 3.166667e-05 1.324773e-03 1.956349e-05
## MeanDecreaseAccuracy MeanDecreaseGini
## NEK6 2.872128e-05 0.4589293
## HECTD1 -2.174431e-04 0.5157746
## PDE4DIP 6.777219e-03 1.3999604
## POU2F2 -3.595395e-04 0.3954239
## TMX3 3.014424e-04 0.5570351
## KAT5 -6.509862e-04 0.4162122
## PLCXD1 5.905106e-03 1.3618037
## MPDU1 2.763996e-03 0.9856564
## PCSK1N 3.495545e-04 0.5304417
## TCEAL4 1.448099e-03 0.5912317
## KRT36 -1.503262e-04 0.3952479
## HSD11B1 -5.172632e-04 0.3304101
## LOC100287314 3.096906e-04 0.6100534
## CHGA 2.162620e-04 0.6150855
## STK38L -5.331447e-04 0.3985381
## BRD3 3.302302e-04 0.5888511
## TMSB4Y 1.039050e-03 0.7175469
## GGT7 -5.937594e-04 0.3342483
## RNF217 -2.959888e-04 0.4962428
## ALAS2 -2.821824e-04 0.5104183
## LILRA6 1.484886e-02 2.3446198
## ADRBK2 6.533095e-04 0.5355615
## TAS2R3 -2.430337e-04 0.4652468
## C1orf51 -7.318249e-04 0.3552346
## CFTR -4.469058e-04 0.3247389
## HS3ST3B1 -1.625812e-04 0.4923773
## LRRCC1 9.067389e-05 0.4370770
## MPO 2.366932e-03 1.0157770
## LOC653562 -3.804756e-04 0.5558021
## TPBG -6.311377e-04 0.3366068
## SCN11A -4.738956e-05 0.4383964
## VSIG8 -2.714597e-04 0.4250916
## CD2AP -3.363006e-04 0.4615719
## FLJ23867 2.494598e-03 1.0046261
## BEGAIN 9.947838e-05 0.5266534
## MGC34800 1.809359e-04 0.5726299
## P2RX7 3.315751e-03 0.9314206
## WDR46 1.203960e-03 0.7684234
## GSC 1.598551e-04 0.5463876
## C3orf42 -2.386917e-04 0.3506272
## MTMR14 5.728359e-04 0.6699965
## FBXO48 -2.683464e-04 0.3947342
## H2AFJ 1.988862e-03 1.0550720
## KRTAP10.7 -1.371929e-04 0.5625267
## SNORA74A 2.814653e-03 0.9362515
## RASAL1 -2.537410e-04 0.3562299
## ENSG00000229196 -6.112341e-04 0.4269099
## FBXO9 6.654369e-04 0.6961899
## SOX10 -1.025804e-04 0.4788161
## C1orf127 -2.302498e-04 0.4154327
## F13A1 2.110151e-03 1.0254595
## OR10A3 -4.889625e-04 0.4279342
## LOC145757 9.878939e-04 0.6239428
## FAT2 1.304458e-04 0.5913211
## OOEP -4.986156e-04 0.4146760
## EIF2C2 2.412391e-04 0.5792137
## MMP13 -1.667566e-04 0.4245617
## MAP2K3 5.174084e-04 0.6661249
## BRSK1 6.941550e-03 1.1518720
## RHOBTB2 -8.820358e-05 0.4956573
## Septin.6 8.963937e-04 0.6698110
## STRA8 -5.344679e-04 0.4113137
## ATP2C1 -3.376182e-04 0.4450835
## OR14A16 -1.157674e-04 0.4908333
## AHSG 3.693337e-05 0.4334393
## ORC6L 1.043335e-03 0.8923605
## TLX1NB -3.270081e-04 0.3943080
## MTHFD2 1.029491e-03 0.7150768
## GMPPB -9.926669e-05 0.3748746
## RGPD3 -4.544043e-04 0.4032293
## FXR1 -6.308354e-05 0.5170377
## GJA5 5.400380e-04 0.6246474
## ZNF836 9.541820e-05 0.4513524
## C16orf48 -1.354554e-04 0.3652100
## DNAJC24 7.921279e-03 1.6578556
## TSHR 3.837562e-04 0.7023384
## MTA3 -3.749652e-04 0.4890602
## NDST3 -2.258230e-04 0.3999704
## OR12D2 -7.141836e-04 0.3299746
## MKL1 4.188917e-04 0.4796318
Now order by decreasing for the 4 classes and important features.
acuteTop <- variableImportance[order(variableImportance$acute, decreasing=T),]
acuteTop$Gene <- row.names(acuteTop)
head(acuteTop)
## X1.month X6.month acute healthy
## LILRA6 2.839207e-03 -0.001080000 0.033777711 0.0142450758
## BRSK1 -1.404646e-03 0.002861667 0.012995086 0.0117191270
## PDE4DIP 2.285871e-03 -0.001536429 0.012653335 0.0098516017
## P2RX7 -2.874459e-05 0.001012619 0.008740327 0.0014335317
## SNORA74A 6.677861e-04 -0.002249405 0.008034345 0.0013142208
## F13A1 -2.749952e-03 0.005108333 0.007095081 -0.0001239899
## MeanDecreaseAccuracy MeanDecreaseGini Gene
## LILRA6 0.014848864 2.3446198 LILRA6
## BRSK1 0.006941550 1.1518720 BRSK1
## PDE4DIP 0.006777219 1.3999604 PDE4DIP
## P2RX7 0.003315751 0.9314206 P2RX7
## SNORA74A 0.002814653 0.9362515 SNORA74A
## F13A1 0.002110151 1.0254595 F13A1
The above genes are not the genes we found when only scraping top 20 and bottom 20 fold change genes of acute/healthy, month1/healthy, or month6/healthy. This Rstudio run gave the top 5 genes as LILRA6, BRSK1, PDE4DIP, P2RX7, and SNORA74A. This data is using a different fold change value that was there but not used of the acute infection compared to six months after infection.
predictions <- predict(ML80.rf, newdata=testing)
predictions
## 1 3 19 20 24 35 37 39 47 52
## healthy healthy acute healthy 1 month acute acute acute acute acute
## 58 59 60 61 63 64 68 78
## 1 month 1 month 1 month acute acute 1 month acute acute
## Levels: 1 month 6 month acute healthy
sum(predictions==testing$class)/length(testing$class)
## [1] 0.6111111
set.seed(123)
modelResult <- data.frame(predictions, type=testing$class)
modelResult
## predictions type
## 1 healthy healthy
## 3 healthy healthy
## 19 acute healthy
## 20 healthy healthy
## 24 1 month acute
## 35 acute acute
## 37 acute acute
## 39 acute acute
## 47 acute acute
## 52 acute 1 month
## 58 1 month 1 month
## 59 1 month 1 month
## 60 1 month 1 month
## 61 acute 1 month
## 63 acute 1 month
## 64 1 month 1 month
## 68 acute 1 month
## 78 acute 6 month
This model scored 61% accuracy using fold change values and these parameter settings for random forest in the randomForest package of 80% data trained, 20% tested on, and 10k trees with default parameter settings of randomForest function.
Tune a model to find best mtry to randomForest function. The number of mtry parameter is using how many of our 80 features is best for the one out bag calculations. Parameter settings for doBest set to TRUE is to run the forest model with best mtry value found, the stepFactor is to run by this number, and improve parameter is to keep tuning until you cannot get better than the value you set for improve with each run or iteration.
tunedModel <- tuneRF(training[,-81], training[,81], ntreetry= 10000,stepFactor= 1.2
, doBest=TRUE, improve=.6)
## mtry = 8 OOB error = 58.82%
## Searching left ...
## mtry = 7 OOB error = 55.88%
## 0.05 0.6
## Searching right ...
## mtry = 9 OOB error = 47.06%
## 0.2 0.6
set.seed(123)
ML80.rf.tuned <- randomForest(class ~ ., data=training, mtry=9, ntree=10000, keep.forest=TRUE,
importance=TRUE)
print(ML80.rf.tuned)
##
## Call:
## randomForest(formula = class ~ ., data = training, mtry = 9, ntree = 10000, keep.forest = TRUE, importance = TRUE)
## Type of random forest: classification
## Number of trees: 10000
## No. of variables tried at each split: 9
##
## OOB estimate of error rate: 55.88%
## Confusion matrix:
## 1 month 6 month acute healthy class.error
## 1 month 5 1 9 4 0.7368421
## 6 month 2 0 2 5 1.0000000
## acute 5 1 15 2 0.3478261
## healthy 3 2 2 10 0.4117647
importance(ML80.rf.tuned)
## 1 month 6 month acute healthy
## NEK6 3.17844419 -0.20591611 -0.181587507 0.45657905
## HECTD1 -3.23224966 2.01230032 -0.456387976 3.15973725
## PDE4DIP 5.50282801 -4.96272788 19.532684711 14.73682403
## POU2F2 -0.97868881 -1.66621045 1.232438965 0.10915371
## TMX3 1.77677173 3.40399040 1.330505606 -1.66766137
## KAT5 2.13500878 -0.98935876 -1.064391432 -3.34283616
## PLCXD1 0.05115988 2.06338225 12.085639666 20.01562952
## MPDU1 -1.70015009 10.59636397 5.439229817 12.07007164
## PCSK1N 2.32622886 -4.29942757 0.636711116 2.18952169
## TCEAL4 4.41129695 2.00755649 5.759289050 -1.61974637
## KRT36 -2.41543261 -1.79889318 0.972345937 -1.28134488
## HSD11B1 -2.56394381 -1.15265324 -3.357116862 -0.39810103
## LOC100287314 1.89120092 -3.78265392 1.151139943 2.70494631
## CHGA -3.11532343 8.51944071 0.004472316 -0.86066088
## STK38L -1.36470741 -0.18506387 -0.054555371 -1.32198785
## BRD3 -2.49293421 -0.00936427 0.641002086 5.61718981
## TMSB4Y -2.03867654 9.10506911 -2.646282514 5.23276317
## GGT7 -4.16649318 -2.18515925 0.282770347 -3.22123669
## RNF217 1.09015687 -1.40395981 -1.898577108 2.29843030
## ALAS2 -4.05131872 -1.68428741 0.864569347 -2.61258600
## LILRA6 3.00883618 -1.67575108 34.208248136 20.14631723
## ADRBK2 4.31417460 -3.79667201 7.122397903 -1.40847347
## TAS2R3 -2.19917341 2.77215285 -2.118100737 -2.81210034
## C1orf51 -3.35600084 -1.75780038 -2.046965391 -3.22988227
## CFTR -2.85414406 -0.59670304 -3.688653390 -1.08951558
## HS3ST3B1 -2.33061149 -0.54571099 2.786566431 -0.38925956
## LRRCC1 -0.90377577 -1.89223051 -2.965499292 2.21031475
## MPO -5.98461589 -3.53479480 11.320137675 13.62988082
## LOC653562 -0.25411489 -0.70623610 0.179256951 0.53852482
## TPBG -2.32881961 -1.68535854 -3.487047713 -3.82323852
## SCN11A -0.85873735 2.07357111 -1.192044616 1.94138201
## VSIG8 -3.47982755 -0.26420677 -1.815027755 0.05763974
## CD2AP -2.50591893 -3.68639654 -0.481841054 0.81889122
## FLJ23867 4.42557672 -4.76616737 0.036046521 14.98017631
## BEGAIN 3.85298948 0.15046666 -2.791367131 -1.45165018
## MGC34800 5.22561779 -1.55677007 -0.872611462 -3.22891960
## P2RX7 -0.76096009 2.63268898 16.948178258 3.52474644
## WDR46 -4.50074546 -2.97246049 4.283060482 11.85131167
## GSC -4.22936114 2.32756918 0.771601652 -0.55954593
## C3orf42 -0.93613658 -2.21768865 -1.650824455 -3.90974555
## MTMR14 -2.49888385 2.14302169 6.252696884 -2.57447892
## FBXO48 -2.77187454 -1.96255818 -0.219189489 -1.46693994
## H2AFJ -0.50583933 -0.16171982 5.655937303 9.07740847
## KRTAP10.7 -1.19036069 6.40120355 -2.222688086 -2.77630340
## SNORA74A 3.88576774 -3.69759347 16.149637644 1.09875782
## RASAL1 -2.81177660 0.06366687 -2.933931687 -3.18227380
## ENSG00000229196 -3.96883511 -1.52556777 -2.782718461 0.53452710
## FBXO9 -2.30069809 -1.63171936 0.570024247 6.25378097
## SOX10 1.31806016 -1.35043453 -1.140234952 -2.23496821
## C1orf127 0.55341063 -0.03530793 -2.533537395 0.56839388
## F13A1 -5.88077459 8.30790264 14.119805385 1.50085833
## OR10A3 -0.79076088 -1.32404106 0.576778624 1.68343903
## LOC145757 1.05989949 -4.42575815 -0.114942752 10.14242654
## FAT2 -0.67230521 1.24938167 -1.143370605 1.36840666
## OOEP -2.17898701 -4.14634944 -5.360137041 1.57056264
## EIF2C2 -0.25343857 -0.50234771 -0.278105830 0.73947703
## MMP13 -4.25949125 2.15163178 1.174678886 -3.31696369
## MAP2K3 -3.69766270 1.80880254 4.358339124 0.28895062
## BRSK1 -1.84613588 4.01896130 19.208748573 15.66402314
## RHOBTB2 -3.47418297 -3.00421646 0.611333970 1.37349661
## Septin.6 1.27441228 0.46150232 -0.311437052 6.62585412
## STRA8 -1.82117373 -1.17639253 -2.184584069 -1.27988366
## ATP2C1 -1.05430677 -3.29612966 -1.276168144 2.26361634
## OR14A16 -0.85530413 5.51835041 -1.521453427 -1.58313532
## AHSG -0.09273714 4.28090409 -3.507839512 -0.75226232
## ORC6L -0.43637922 0.19477909 1.758969828 5.69571078
## TLX1NB -0.29279018 -2.58847157 -2.643178648 0.15947685
## MTHFD2 3.35088428 0.44182961 2.720611326 3.53590199
## GMPPB -1.63755788 0.68188114 0.148882673 -0.75541725
## RGPD3 -3.14600483 -2.77011872 -2.380442814 -1.96091162
## FXR1 -2.61676762 -3.14271707 3.434424820 1.55321105
## GJA5 1.37297993 -0.08861429 4.572555961 -3.83214182
## ZNF836 -1.04521754 -5.15611373 -2.212134657 -0.31807492
## C16orf48 -1.80484440 -2.21415806 -0.994680410 -2.28364894
## DNAJC24 26.37460352 -0.16654904 3.621126222 12.44330299
## TSHR -5.76097167 0.01032284 12.562694740 -3.43890403
## MTA3 -2.00615513 -1.86973795 -1.470729555 -0.37929565
## NDST3 0.36789166 -1.73556731 -0.987385495 -0.93164578
## OR12D2 -1.13819608 -2.50780767 -4.146718290 -3.47672020
## MKL1 0.54018706 -0.78513365 4.864803360 -2.05855060
## MeanDecreaseAccuracy MeanDecreaseGini
## NEK6 1.16823440 0.4351816
## HECTD1 0.47489509 0.4673643
## PDE4DIP 20.22962331 1.5013514
## POU2F2 -0.73716890 0.3741187
## TMX3 2.18030576 0.6238894
## KAT5 -1.37198197 0.4208338
## PLCXD1 18.14244752 1.4557262
## MPDU1 10.58400672 1.0063831
## PCSK1N 1.29183550 0.5486337
## TCEAL4 5.32258633 0.6060380
## KRT36 -1.97340893 0.3672726
## HSD11B1 -3.52434651 0.3137500
## LOC100287314 1.44617277 0.5906007
## CHGA 1.75993366 0.6349335
## STK38L -1.31905098 0.3908343
## BRD3 2.25013512 0.5905398
## TMSB4Y 3.61988139 0.7394827
## GGT7 -4.38632089 0.3097573
## RNF217 -0.01196054 0.4795363
## ALAS2 -2.94598963 0.4973330
## LILRA6 32.94076878 2.4764896
## ADRBK2 6.01364213 0.5134961
## TAS2R3 -2.58852148 0.4461379
## C1orf51 -5.25365014 0.3395588
## CFTR -4.56412097 0.3321699
## HS3ST3B1 -0.60373500 0.4847396
## LRRCC1 -2.04360330 0.4147686
## MPO 9.48435629 1.0529854
## LOC653562 0.03313130 0.5408874
## TPBG -5.32252139 0.3198409
## SCN11A 0.50478459 0.4078703
## VSIG8 -2.96955571 0.4326255
## CD2AP -2.90556360 0.4619491
## FLJ23867 9.29083843 0.9600182
## BEGAIN 0.06751539 0.5347422
## MGC34800 0.80578677 0.5620145
## P2RX7 13.88018035 0.9448540
## WDR46 6.18230764 0.7482990
## GSC -1.33267261 0.5633950
## C3orf42 -3.70206883 0.3191279
## MTMR14 1.86662049 0.6750362
## FBXO48 -2.91835665 0.3779377
## H2AFJ 7.56750363 1.0969368
## KRTAP10.7 -1.31744201 0.5456194
## SNORA74A 11.27666348 0.8980419
## RASAL1 -4.88741602 0.3703968
## ENSG00000229196 -3.76988346 0.4109683
## FBXO9 2.46258369 0.6969781
## SOX10 -1.21349086 0.4607025
## C1orf127 -0.94672243 0.4210986
## F13A1 9.71149499 1.0620147
## OR10A3 0.54963710 0.4284370
## LOC145757 4.71883266 0.6176924
## FAT2 -0.05211871 0.5636897
## OOEP -4.73836558 0.3935767
## EIF2C2 0.09635417 0.6009054
## MMP13 -2.06842161 0.4133290
## MAP2K3 1.04416808 0.6763739
## BRSK1 19.87439214 1.1816095
## RHOBTB2 -1.78224177 0.4718620
## Septin.6 4.12097645 0.6607393
## STRA8 -3.50343517 0.4091857
## ATP2C1 -1.54121365 0.4234493
## OR14A16 -0.15282018 0.5077850
## AHSG -1.16233381 0.4118094
## ORC6L 4.02713563 0.9164653
## TLX1NB -2.23597922 0.3893407
## MTHFD2 5.32288040 0.7195708
## GMPPB -0.81274239 0.3901742
## RGPD3 -4.53826166 0.4156880
## FXR1 1.28468334 0.4801976
## GJA5 1.93215376 0.5839072
## ZNF836 -3.52868805 0.4453706
## C16orf48 -3.51418085 0.3405220
## DNAJC24 24.51004801 1.7768431
## TSHR 4.01876485 0.6547802
## MTA3 -2.19073845 0.4818738
## NDST3 -1.63901814 0.3787438
## OR12D2 -5.69861114 0.3140389
## MKL1 1.65277453 0.4683902
predictions1 <- predict(ML80.rf.tuned, newdata=testing)
predictions1
## 1 3 19 20 24 35 37 39 47 52
## healthy healthy 1 month healthy 1 month acute acute acute acute acute
## 58 59 60 61 63 64 68 78
## 1 month 1 month 1 month acute acute 1 month acute acute
## Levels: 1 month 6 month acute healthy
prediction1_df <- data.frame(predictions1,type=testing$class)
prediction1_df
## predictions1 type
## 1 healthy healthy
## 3 healthy healthy
## 19 1 month healthy
## 20 healthy healthy
## 24 1 month acute
## 35 acute acute
## 37 acute acute
## 39 acute acute
## 47 acute acute
## 52 acute 1 month
## 58 1 month 1 month
## 59 1 month 1 month
## 60 1 month 1 month
## 61 acute 1 month
## 63 acute 1 month
## 64 1 month 1 month
## 68 acute 1 month
## 78 acute 6 month
sum(prediction1_df$predictions1==prediction1_df$type)/length(prediction1_df$predictions1)
## [1] 0.6111111
61% accurate
After tuning the model, the results were the same as the first run with default parameters of 61% accuracy in prediction. Probably because the data was limited in class type to begin with. So we can just try the fold change values to see if only keeping the acute and chronic cases would give better results. Two classes should have better odds than four classes, but again there are the least amount of chronic classes and our model inaccurately predicted any of the chronic cases even though these are the top gene expression genes in fold change values of acute to chronic infection.
Lets just see the imbalance again.
ML80 %>% group_by(class) %>% count(class)
## # A tibble: 4 × 2
## # Groups: class [4]
## class n
## <fct> <int>
## 1 1 month 27
## 2 6 month 10
## 3 acute 28
## 4 healthy 21
So, we are to remove the 1 month class of 27 samples and the healthy 21 samples. We are left with 10 chronic samples and 28 acute samples.
ML80_onlyAcute6month <- subset(ML80, ML80$class=="acute" | ML80$class=='6 month')
ML80_onlyAcute6month$class
## [1] acute acute acute acute acute acute acute acute acute
## [10] acute acute acute acute acute acute acute acute acute
## [19] acute acute acute acute acute acute acute acute acute
## [28] acute 6 month 6 month 6 month 6 month 6 month 6 month 6 month 6 month
## [37] 6 month 6 month
## Levels: 1 month 6 month acute healthy
ML80_onlyAcute6month %>% group_by(class) %>% count(class)
## # A tibble: 2 × 2
## # Groups: class [2]
## class n
## <fct> <int>
## 1 6 month 10
## 2 acute 28
There are 10 samples of 6 month and 28 of acute. Lets get rolling with our training and testing sets and then the randomForest model tuning and building.
set.seed(123)
intrain <- sample(1:38, .8*38)
training <- ML80_onlyAcute6month[intrain,]
testing <- ML80_onlyAcute6month[-intrain,]
training %>% group_by(class) %>% count(class)
## # A tibble: 2 × 2
## # Groups: class [2]
## class n
## <fct> <int>
## 1 6 month 9
## 2 acute 21
The model is building a classifier off 9 observations of chronic cases and 21 acute cases.
testing %>% group_by(class) %>% count(class)
## # A tibble: 2 × 2
## # Groups: class [2]
## class n
## <fct> <int>
## 1 6 month 1
## 2 acute 7
The testing class will have 1 class of chronic and 7 cases of acute to predict accurately.
Lets see how this goes. Lets tune first for best mtry to use then use the randomForest function. run or iteration.
# tunedModel1 <- tuneRF(training[,1:80], training[,-81], ntreetry= 10000,stepFactor= .5
# , doBest=TRUE, improve=.5)
“Error in randomForest.default(x, y, mtry = mtryStart, ntree = ntreeTry, : length of response must be the same as predictors”
The above tuning won’t work, must be because of the limitation on classes in the target only having 1 chronic case compared to 7. So we move on to the random forest modeling.
# set.seed(123)
# ML80.rf.tuned <- randomForest(class ~ ., data=training, mtry=9, ntree=10000, keep.forest=TRUE,
# importance=TRUE)
# print(ML80.rf.tuned)
“Error in randomForest.default(m, y, …) : Can’t have empty classes in y.” Maybe a problem with factorization as the str() will probably give some information, should only have 2 classes but showed 4.
str(ML80_onlyAcute6month$class)
## Factor w/ 4 levels "1 month","6 month",..: 3 3 3 3 3 3 3 3 3 3 ...
So we have a problem, there are only 2 classes from our removal of two and yet the string attaches those factors. Lets try changing it to character and back to factor.
ML80_onlyAcute6month$class <- as.character(ML80_onlyAcute6month$class)
str(ML80_onlyAcute6month$class)
## chr [1:38] "acute" "acute" "acute" "acute" "acute" "acute" "acute" "acute" ...
ML80_onlyAcute6month$class
## [1] "acute" "acute" "acute" "acute" "acute" "acute" "acute"
## [8] "acute" "acute" "acute" "acute" "acute" "acute" "acute"
## [15] "acute" "acute" "acute" "acute" "acute" "acute" "acute"
## [22] "acute" "acute" "acute" "acute" "acute" "acute" "acute"
## [29] "6 month" "6 month" "6 month" "6 month" "6 month" "6 month" "6 month"
## [36] "6 month" "6 month" "6 month"
There are only 2 character entries, now change back to factors and see if valid.
ML80_onlyAcute6month$class <- as.factor(ML80_onlyAcute6month$class)
str(ML80_onlyAcute6month$class)
## Factor w/ 2 levels "6 month","acute": 2 2 2 2 2 2 2 2 2 2 ...
That seemed to work, so we move forward with tuning and modeling.
training <- ML80_onlyAcute6month[intrain,]
testing <- ML80_onlyAcute6month[-intrain,]
tunedModel1 <- tuneRF(training[,1:80], training$class, ntreetry= 10000,stepFactor= .5
, doBest=TRUE, improve=.5)
## mtry = 8 OOB error = 30%
## Searching left ...
## mtry = 16 OOB error = 20%
## 0.3333333 0.5
## Searching right ...
## mtry = 4 OOB error = 33.33%
## -0.1111111 0.5
The above says less than 24% error in out of bag samples with at least 8
features of mtry. So making the model with this number now.
set.seed(123)
ML80.rf.tuned1 <- randomForest(class ~ ., data=training, mtry=8, ntree=10000, keep.forest=TRUE,
importance=TRUE)
print(ML80.rf.tuned1)
##
## Call:
## randomForest(formula = class ~ ., data = training, mtry = 8, ntree = 10000, keep.forest = TRUE, importance = TRUE)
## Type of random forest: classification
## Number of trees: 10000
## No. of variables tried at each split: 8
##
## OOB estimate of error rate: 26.67%
## Confusion matrix:
## 6 month acute class.error
## 6 month 2 7 0.77777778
## acute 1 20 0.04761905
The model did very well on the acute cases and actually predicted a few of the 6 month cases. Lets see how it predicts on the testing set of unseen data.
predictions2 <- predict(ML80.rf.tuned1, newdata=testing)
predictions2
## 23 27 34 37 39 44 45 85
## acute acute acute acute acute acute acute acute
## Levels: 6 month acute
model2classes <- data.frame(predictions2,type=testing$class)
model2classes
## predictions2 type
## 23 acute acute
## 27 acute acute
## 34 acute acute
## 37 acute acute
## 39 acute acute
## 44 acute acute
## 45 acute acute
## 85 acute 6 month
All acute classes were predicted correctly but the only 6 month class was incorrectly predicted to be acute. That is a good introduction and data science project in understanding and reviewing how little nuances in factor attachment and classification with fewer classes as well as exploration of the randomForest package. Clearly the accuracy is much better but lets just see the accuracy measure for ourselves.
sum(model2classes$predictions2==model2classes$type)/length(model2classes$predictions2)
## [1] 0.875
Much better accuracy, so wonderful. Alright, well thanks so much for reading this exploration and data science hands on assignment.