This goal of this R markdown document is to explore the data derived from a larger set of proteins. The data set consists of 13562 proteins. The final goal of this work is to determine what impact machine-leanring on the classification methods such as Random-Forest and CART.
library(corrplot) # produces correlation plots of variables## corrplot 0.84 loaded
library(caret) #For Random Forest & CART & partioning of data## Loading required package: lattice
## Loading required package: ggplot2
library(randomForest)## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
setwd("~/Dropbox/git_projects/random_forest/2_single_aa")
total_aa <- read.csv("~/Dropbox/git_projects/random_forest/2_single_aa/total_aa.csv")
# View(total_aa)
# View the top few rows of the data in R console
# head(total_aa,5)# Loading the Caret package (above) allows us to partition the data
# using 'createDataPartition'.
# We use the dataset to create a partition (80% training 20% testing)
index <- createDataPartition(total_aa$Class, p = 0.80, list = FALSE)
# select 80% of data to train the models
trainset <- total_aa[ index,]
# select 20% of the data for testing
testset <- total_aa[-index,]
dim(testset)## [1] 3390 22
# Dimensions of the data
dim(trainset)## [1] 13562 22
trainset$Class <- factor(trainset$Class)
levels(trainset$Class) <- c("erythrocruorin", "hemerythrin", "hemocyanin",
"hemoglobin", "homo_sapiens", "leghemoglobin",
"myoglobin")
levels(trainset$Class)## [1] "erythrocruorin" "hemerythrin" "hemocyanin" "hemoglobin"
## [5] "homo_sapiens" "leghemoglobin" "myoglobin"
str(trainset)## 'data.frame': 13562 obs. of 22 variables:
## $ Class : Factor w/ 7 levels "erythrocruorin",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ TotalAA: int 393 1210 2871 770 2351 2527 920 1002 2009 1863 ...
## $ G : num 0.0585 0.0702 0.1069 0.0494 0.0549 ...
## $ P : num 0.1145 0.062 0.061 0.0455 0.0544 ...
## $ A : num 0.0611 0.0595 0.0334 0.0818 0.0468 ...
## $ V : num 0.0458 0.0579 0.0376 0.0844 0.0532 ...
## $ L : num 0.0814 0.0917 0.0491 0.0727 0.0944 ...
## $ I : num 0.0204 0.057 0.0519 0.0312 0.0468 ...
## $ M : num 0.0305 0.0207 0.0181 0.0312 0.0259 ...
## $ C : num 0.0254 0.0496 0.1261 0.0234 0.0111 ...
## $ F : num 0.028 0.0298 0.0293 0.0273 0.0464 ...
## $ Y : num 0.0229 0.0298 0.0324 0.026 0.0336 ...
## $ W : num 0.01018 0.01074 0.00453 0.01169 0.01574 ...
## $ H : num 0.0305 0.0256 0.0167 0.0325 0.0319 ...
## $ K : num 0.0509 0.0545 0.0387 0.0532 0.0672 ...
## $ R : num 0.0662 0.0496 0.0456 0.0481 0.0442 ...
## $ Q : num 0.0382 0.0405 0.0348 0.0468 0.0468 ...
## $ N : num 0.0356 0.0545 0.0655 0.0403 0.0532 ...
## $ E : num 0.0763 0.0636 0.07 0.1195 0.063 ...
## $ D : num 0.0509 0.0504 0.0603 0.0649 0.0515 ...
## $ S : num 0.0967 0.0694 0.0603 0.0455 0.0932 ...
## $ T : num 0.056 0.0529 0.0578 0.0649 0.0659 ...
tail(trainset)## Class TotalAA G P A V
## 16947 myoglobin 103 0.09708738 0.01941748 0.10679610 0.08737864
## 16948 myoglobin 156 0.03846154 0.03846154 0.06410256 0.05769231
## 16949 myoglobin 147 0.08843537 0.03401361 0.09523810 0.09523810
## 16950 myoglobin 143 0.06993007 0.04895105 0.10489510 0.04895105
## 16951 myoglobin 146 0.09589041 0.03424658 0.09589041 0.12328770
## 16952 myoglobin 141 0.10638300 0.04255319 0.06382979 0.09929078
## L I M C F Y
## 16947 0.1165049 0.019417480 0.019417480 0.009708738 0.03883495 0.01941748
## 16948 0.1217949 0.032051280 0.025641030 0.012820510 0.05128205 0.02564103
## 16949 0.1156463 0.040816330 0.020408160 0.013605440 0.04761905 0.02040816
## 16950 0.1608392 0.006993007 0.013986010 0.013986010 0.05594406 0.02097902
## 16951 0.1301370 0.000000000 0.006849315 0.006849315 0.05479452 0.02054795
## 16952 0.1063830 0.014184400 0.021276600 0.021276600 0.04964539 0.02127660
## W H K R Q N
## 16947 0.019417480 0.04854369 0.07766990 0.01941748 0.009708738 0.04854369
## 16948 0.019230770 0.05128205 0.06410256 0.07692308 0.032051280 0.01923077
## 16949 0.013605440 0.06122449 0.08163265 0.01360544 0.027210880 0.04761905
## 16950 0.006993007 0.06293706 0.04195804 0.03496503 0.041958040 0.02097902
## 16951 0.013698630 0.06164384 0.07534247 0.02739726 0.027397260 0.04794521
## 16952 0.014184400 0.03546099 0.09929078 0.02127660 0.028368790 0.04255319
## E D S T
## 16947 0.04854369 0.07766990 0.07766990 0.03883495
## 16948 0.05769231 0.05128205 0.10897440 0.05128205
## 16949 0.04081633 0.04761905 0.06122449 0.03401361
## 16950 0.02097902 0.06993007 0.09090909 0.06293706
## 16951 0.06849315 0.04794521 0.04109589 0.02054795
## 16952 0.06382979 0.04964539 0.05673759 0.04255319
summary(trainset)## Class TotalAA G
## erythrocruorin:13057 Min. : 4.0 Min. :0.00000
## hemerythrin : 10 1st Qu.: 254.0 1st Qu.:0.05047
## hemocyanin : 10 Median : 431.0 Median :0.06464
## hemoglobin : 58 Mean : 589.8 Mean :0.06751
## homo_sapiens : 10 3rd Qu.: 706.0 3rd Qu.:0.08025
## leghemoglobin : 101 Max. :8797.0 Max. :0.46474
## myoglobin : 316
## P A V L
## Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.04150 1st Qu.:0.05506 1st Qu.:0.04843 1st Qu.:0.08060
## Median :0.05481 Median :0.06960 Median :0.06040 Median :0.09804
## Mean :0.06106 Mean :0.07342 Mean :0.06148 Mean :0.09910
## 3rd Qu.:0.07383 3rd Qu.:0.08772 3rd Qu.:0.07273 3rd Qu.:0.11683
## Max. :0.38060 Max. :0.40000 Max. :0.33333 Max. :0.32323
##
## I M C F
## Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.02883 1st Qu.:0.01521 1st Qu.:0.01181 1st Qu.:0.02637
## Median :0.04245 Median :0.02105 Median :0.01872 Median :0.03610
## Mean :0.04283 Mean :0.02232 Mean :0.02315 Mean :0.03751
## 3rd Qu.:0.05556 3rd Qu.:0.02765 3rd Qu.:0.02765 3rd Qu.:0.04735
## Max. :0.21538 Max. :0.13836 Max. :0.36816 Max. :0.13636
##
## Y W H K
## Min. :0.00000 Min. :0.000000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.01852 1st Qu.:0.006555 1st Qu.:0.01766 1st Qu.:0.03943
## Median :0.02653 Median :0.011450 Median :0.02431 Median :0.05628
## Mean :0.02765 Mean :0.012778 Mean :0.02635 Mean :0.05884
## 3rd Qu.:0.03540 3rd Qu.:0.017237 3rd Qu.:0.03185 3rd Qu.:0.07438
## Max. :0.23611 Max. :0.232877 Max. :0.30000 Max. :0.28866
##
## R Q N E
## Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.04286 1st Qu.:0.03350 1st Qu.:0.02457 1st Qu.:0.05157
## Median :0.05500 Median :0.04368 Median :0.03488 Median :0.06652
## Mean :0.05762 Mean :0.04562 Mean :0.03547 Mean :0.06874
## 3rd Qu.:0.06910 3rd Qu.:0.05462 3rd Qu.:0.04513 3rd Qu.:0.08281
## Max. :0.47059 Max. :0.98750 Max. :0.14286 Max. :0.27072
##
## D S T
## Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.03770 1st Qu.:0.06107 1st Qu.:0.04149
## Median :0.04791 Median :0.07531 Median :0.05098
## Mean :0.04814 Mean :0.07800 Mean :0.05227
## 3rd Qu.:0.05762 3rd Qu.:0.09185 3rd Qu.:0.06092
## Max. :0.25000 Max. :0.41660 Max. :0.34949
##
par(mfrow=c(1,5))
for(i in 2:22) {
boxplot(trainset[,i], main = names(trainset)[i])
}# Reduce total_aa data set to amino acids only
corrSet <- total_aa[,2:22]
#str(corrSet)
corrMatrix <- cor(corrSet) # Calculate correlation coefficients
#corrMatrix
# Determine Max and Min
#which(abs(corrMatrix) > 0.4, arr.ind = T)
#Check for corrlation coefficients |corrMatrix(i,j)| > 0.4
MinCorrelation <- min(corrMatrix)
MinCorrelation## [1] -0.3673635
MaxCorrelation <- corrMatrix[ corrMatrix > 0.35 & corrMatrix < 1 ]
MaxCorrelation## [1] 0.3607724 0.3607724
which(corrMatrix < -0.36)## [1] 49 129
which(corrMatrix > 0.35 & corrMatrix < 1) ## [1] 143 343
corrplot(corrMatrix, method = "color", type = "upper")Parallel processing in caret can be accomplished with the parallel and doParallel packages. The following code loads the required libraries (note, these libraries also depend on the iterators and foreach libraries).
library(parallel)
library(doParallel)## Loading required package: foreach
## Loading required package: iterators
cluster <- makeCluster(detectCores() - 1)
# leave 1 core for OS
registerDoParallel(cluster)Use randomForest::train() to train the model * classification (sqrt(p) where p is number of variables in x)
set.seed(1234)
# Time
start_RF <- Sys.time()
# Fit a model
# model <- train(trainset$Class ~ .,
# data = trainset,
# method = "rf",
# trControl = trainControl(method = "cv", number = 5))
modFit.rf <- randomForest::randomForest(trainset$Class ~ .,
data = trainset[,c(3:22)],
ntree = 50,
replace = TRUE,
nodesize = 5, #1(default)
importance = TRUE,
keep.forest = TRUE, #FALSE(default)
norm.votes = TRUE, #TRUE(default)
do.trace = TRUE, #FALSE(default)
proximity = TRUE #FALSE(default)
)## ntree OOB 1 2 3 4 5 6 7
## 1: 1.70% 0.81% 50.00%100.00% 62.96%100.00% 18.42% 9.17%
## 2: 1.57% 0.70% 85.71%100.00% 54.29% 85.71% 22.95% 11.48%
## 3: 1.54% 0.72% 88.89% 88.89% 54.55% 50.00% 22.97% 10.87%
## 4: 1.43% 0.68% 90.00% 70.00% 49.02% 62.50% 19.54% 10.20%
## 5: 1.33% 0.54% 90.00% 80.00% 47.27% 55.56% 21.28% 11.19%
## 6: 1.33% 0.57%100.00% 80.00% 40.35% 55.56% 25.77% 9.69%
## 7: 1.28% 0.50%100.00% 80.00% 46.55% 44.44% 22.45% 10.85%
## 8: 1.12% 0.38%100.00% 80.00% 43.10% 60.00% 20.41% 10.16%
## 9: 1.11% 0.37%100.00% 80.00% 39.66% 30.00% 24.00% 10.36%
## 10: 1.01% 0.29%100.00% 80.00% 39.66% 50.00% 22.00% 9.65%
## 11: 0.97% 0.29%100.00% 80.00% 43.10% 40.00% 21.00% 7.99%
## 12: 0.90% 0.22%100.00% 80.00% 39.66% 40.00% 21.00% 8.25%
## 13: 0.85% 0.19%100.00% 80.00% 39.66% 30.00% 19.80% 8.23%
## 14: 0.83% 0.17%100.00% 80.00% 39.66% 50.00% 18.81% 8.23%
## 15: 0.80% 0.17%100.00% 70.00% 34.48% 50.00% 19.80% 7.91%
## 16: 0.83% 0.15%100.00% 80.00% 39.66% 50.00% 20.79% 8.23%
## 17: 0.83% 0.15%100.00% 70.00% 41.38% 40.00% 22.77% 8.23%
## 18: 0.78% 0.11%100.00% 80.00% 41.38% 40.00% 21.78% 7.28%
## 19: 0.78% 0.12%100.00% 70.00% 41.38% 40.00% 20.79% 7.59%
## 20: 0.76% 0.11%100.00% 70.00% 39.66% 50.00% 21.78% 6.65%
## 21: 0.76% 0.11%100.00% 90.00% 39.66% 40.00% 20.79% 6.65%
## 22: 0.72% 0.11%100.00% 80.00% 32.76% 40.00% 20.79% 6.65%
## 23: 0.74% 0.09%100.00% 80.00% 39.66% 40.00% 20.79% 7.28%
## 24: 0.74% 0.09%100.00% 80.00% 37.93% 40.00% 22.77% 6.96%
## 25: 0.72% 0.08%100.00% 70.00% 39.66% 50.00% 19.80% 6.96%
## 26: 0.68% 0.06%100.00% 70.00% 37.93% 40.00% 19.80% 6.65%
## 27: 0.72% 0.06%100.00% 80.00% 41.38% 40.00% 19.80% 7.28%
## 28: 0.72% 0.05%100.00% 80.00% 41.38% 50.00% 19.80% 7.28%
## 29: 0.74% 0.07%100.00% 90.00% 41.38% 40.00% 20.79% 7.28%
## 30: 0.74% 0.05%100.00% 90.00% 41.38% 50.00% 21.78% 7.28%
## 31: 0.76% 0.07%100.00% 90.00% 41.38% 50.00% 21.78% 7.59%
## 32: 0.73% 0.06%100.00% 80.00% 43.10% 40.00% 20.79% 7.28%
## 33: 0.72% 0.05%100.00% 80.00% 41.38% 40.00% 20.79% 7.59%
## 34: 0.72% 0.06%100.00% 80.00% 41.38% 40.00% 20.79% 6.96%
## 35: 0.72% 0.05%100.00% 80.00% 41.38% 40.00% 21.78% 6.96%
## 36: 0.69% 0.05%100.00% 80.00% 37.93% 40.00% 20.79% 6.96%
## 37: 0.69% 0.06%100.00% 80.00% 37.93% 40.00% 19.80% 6.96%
## 38: 0.69% 0.07%100.00% 70.00% 34.48% 40.00% 19.80% 7.28%
## 39: 0.69% 0.06%100.00% 70.00% 34.48% 40.00% 21.78% 7.28%
## 40: 0.69% 0.05%100.00% 70.00% 34.48% 40.00% 22.77% 6.96%
## 41: 0.67% 0.05%100.00% 80.00% 34.48% 40.00% 20.79% 6.96%
## 42: 0.69% 0.05%100.00% 80.00% 36.21% 40.00% 21.78% 7.28%
## 43: 0.68% 0.04%100.00% 80.00% 36.21% 40.00% 21.78% 6.96%
## 44: 0.68% 0.03%100.00% 80.00% 37.93% 40.00% 20.79% 7.28%
## 45: 0.69% 0.03%100.00% 80.00% 37.93% 50.00% 20.79% 7.28%
## 46: 0.69% 0.03%100.00% 70.00% 39.66% 50.00% 20.79% 7.28%
## 47: 0.69% 0.05%100.00% 80.00% 36.21% 50.00% 20.79% 7.28%
## 48: 0.70% 0.04%100.00% 80.00% 37.93% 50.00% 21.78% 7.28%
## 49: 0.70% 0.04%100.00% 80.00% 37.93% 50.00% 21.78% 7.28%
## 50: 0.69% 0.04%100.00% 80.00% 37.93% 40.00% 21.78% 7.28%
# Time
end_RF <- Sys.time()
RF_time = end_RF - start_RF
RF_time## Time difference of 23.95638 secs
After processing the data, we explicitly shut down the cluster by calling the stopCluster() function.
stopCluster(cluster)At this point we have a trained model in the fit object, and can take a number of steps to evaluate the suitability of this model, including accuracy and a confusion matrix that is based on comparing the modeled data to the held out folds.
# Let's check the model
modFit.rf##
## Call:
## randomForest(formula = trainset$Class ~ ., data = trainset[, c(3:22)], ntree = 50, replace = TRUE, nodesize = 5, importance = TRUE, keep.forest = TRUE, norm.votes = TRUE, do.trace = TRUE, proximity = TRUE)
## Type of random forest: classification
## Number of trees: 50
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 0.69%
## Confusion matrix:
## erythrocruorin hemerythrin hemocyanin hemoglobin
## erythrocruorin 13052 0 0 4
## hemerythrin 10 0 0 0
## hemocyanin 8 0 2 0
## hemoglobin 20 0 0 36
## homo_sapiens 4 0 0 0
## leghemoglobin 22 0 0 0
## myoglobin 23 0 0 0
## homo_sapiens leghemoglobin myoglobin class.error
## erythrocruorin 0 1 0 0.0003829364
## hemerythrin 0 0 0 1.0000000000
## hemocyanin 0 0 0 0.8000000000
## hemoglobin 0 0 2 0.3793103448
## homo_sapiens 6 0 0 0.4000000000
## leghemoglobin 0 79 0 0.2178217822
## myoglobin 0 0 293 0.0727848101
modFit.rf$mtry## [1] 4
(classification only) vector error rates of the prediction on the input data, the i-th element being the (OOB) error rate for all trees up to the i-th.
#modFit.rf$err.rateShould proximity measure among the rows be calculated?
#modFit.rf$proximityforest
(a list that contains the entire forest; NULL if randomForest is run in unsupervised mode or if keep.forest=FALSE.
#modFit.rf$forest#sessionInfo()