Use of Machine Learning wrt Oxygen-Binding Proteins

This goal of this R markdown document is to explore the data derived from a larger set of proteins. The data set consists of 13562 proteins. The final goal of this work is to determine what impact machine-leanring on the classification methods such as Random-Forest and CART.

Load Libraries

library(corrplot) # produces correlation plots of variables

## corrplot 0.84 loaded

library(caret) #For Random Forest & CART & partioning of data

## Loading required package: lattice

## Loading required package: ggplot2

library(randomForest)

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

Load / Inspect Data

setwd("~/Dropbox/git_projects/random_forest/2_single_aa")
total_aa <- read.csv("~/Dropbox/git_projects/random_forest/2_single_aa/total_aa.csv")
# View(total_aa)

# View the top few rows of the data in R console
# head(total_aa,5)

Partition data into 80/20: trainset & testset

Dimensions of the testset

# Loading the Caret package (above) allows us to partition the data
# using 'createDataPartition'.
# We use the dataset to create a partition (80% training 20% testing)
index <- createDataPartition(total_aa$Class, p = 0.80, list = FALSE)
# select 80% of data to train the models
trainset <- total_aa[ index,]
# select 20% of the data for testing
testset <-  total_aa[-index,]

dim(testset)

## [1] 3390   22

Dimensions of the trainset

# Dimensions of the data
dim(trainset)

## [1] 13562    22

Structure into data Factors

trainset$Class <- factor(trainset$Class)
levels(trainset$Class) <- c("erythrocruorin", "hemerythrin", "hemocyanin",
                            "hemoglobin", "homo_sapiens", "leghemoglobin",
                            "myoglobin")

levels(trainset$Class)

## [1] "erythrocruorin" "hemerythrin"    "hemocyanin"     "hemoglobin"    
## [5] "homo_sapiens"   "leghemoglobin"  "myoglobin"

str(trainset)

## 'data.frame':    13562 obs. of  22 variables:
##  $ Class  : Factor w/ 7 levels "erythrocruorin",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ TotalAA: int  393 1210 2871 770 2351 2527 920 1002 2009 1863 ...
##  $ G      : num  0.0585 0.0702 0.1069 0.0494 0.0549 ...
##  $ P      : num  0.1145 0.062 0.061 0.0455 0.0544 ...
##  $ A      : num  0.0611 0.0595 0.0334 0.0818 0.0468 ...
##  $ V      : num  0.0458 0.0579 0.0376 0.0844 0.0532 ...
##  $ L      : num  0.0814 0.0917 0.0491 0.0727 0.0944 ...
##  $ I      : num  0.0204 0.057 0.0519 0.0312 0.0468 ...
##  $ M      : num  0.0305 0.0207 0.0181 0.0312 0.0259 ...
##  $ C      : num  0.0254 0.0496 0.1261 0.0234 0.0111 ...
##  $ F      : num  0.028 0.0298 0.0293 0.0273 0.0464 ...
##  $ Y      : num  0.0229 0.0298 0.0324 0.026 0.0336 ...
##  $ W      : num  0.01018 0.01074 0.00453 0.01169 0.01574 ...
##  $ H      : num  0.0305 0.0256 0.0167 0.0325 0.0319 ...
##  $ K      : num  0.0509 0.0545 0.0387 0.0532 0.0672 ...
##  $ R      : num  0.0662 0.0496 0.0456 0.0481 0.0442 ...
##  $ Q      : num  0.0382 0.0405 0.0348 0.0468 0.0468 ...
##  $ N      : num  0.0356 0.0545 0.0655 0.0403 0.0532 ...
##  $ E      : num  0.0763 0.0636 0.07 0.1195 0.063 ...
##  $ D      : num  0.0509 0.0504 0.0603 0.0649 0.0515 ...
##  $ S      : num  0.0967 0.0694 0.0603 0.0455 0.0932 ...
##  $ T      : num  0.056 0.0529 0.0578 0.0649 0.0659 ...

tail(trainset)

##           Class TotalAA          G          P          A          V
## 16947 myoglobin     103 0.09708738 0.01941748 0.10679610 0.08737864
## 16948 myoglobin     156 0.03846154 0.03846154 0.06410256 0.05769231
## 16949 myoglobin     147 0.08843537 0.03401361 0.09523810 0.09523810
## 16950 myoglobin     143 0.06993007 0.04895105 0.10489510 0.04895105
## 16951 myoglobin     146 0.09589041 0.03424658 0.09589041 0.12328770
## 16952 myoglobin     141 0.10638300 0.04255319 0.06382979 0.09929078
##               L           I           M           C          F          Y
## 16947 0.1165049 0.019417480 0.019417480 0.009708738 0.03883495 0.01941748
## 16948 0.1217949 0.032051280 0.025641030 0.012820510 0.05128205 0.02564103
## 16949 0.1156463 0.040816330 0.020408160 0.013605440 0.04761905 0.02040816
## 16950 0.1608392 0.006993007 0.013986010 0.013986010 0.05594406 0.02097902
## 16951 0.1301370 0.000000000 0.006849315 0.006849315 0.05479452 0.02054795
## 16952 0.1063830 0.014184400 0.021276600 0.021276600 0.04964539 0.02127660
##                 W          H          K          R           Q          N
## 16947 0.019417480 0.04854369 0.07766990 0.01941748 0.009708738 0.04854369
## 16948 0.019230770 0.05128205 0.06410256 0.07692308 0.032051280 0.01923077
## 16949 0.013605440 0.06122449 0.08163265 0.01360544 0.027210880 0.04761905
## 16950 0.006993007 0.06293706 0.04195804 0.03496503 0.041958040 0.02097902
## 16951 0.013698630 0.06164384 0.07534247 0.02739726 0.027397260 0.04794521
## 16952 0.014184400 0.03546099 0.09929078 0.02127660 0.028368790 0.04255319
##                E          D          S          T
## 16947 0.04854369 0.07766990 0.07766990 0.03883495
## 16948 0.05769231 0.05128205 0.10897440 0.05128205
## 16949 0.04081633 0.04761905 0.06122449 0.03401361
## 16950 0.02097902 0.06993007 0.09090909 0.06293706
## 16951 0.06849315 0.04794521 0.04109589 0.02054795
## 16952 0.06382979 0.04964539 0.05673759 0.04255319

Five number summary of data variables

summary(trainset)

##             Class          TotalAA             G          
##  erythrocruorin:13057   Min.   :   4.0   Min.   :0.00000  
##  hemerythrin   :   10   1st Qu.: 254.0   1st Qu.:0.05047  
##  hemocyanin    :   10   Median : 431.0   Median :0.06464  
##  hemoglobin    :   58   Mean   : 589.8   Mean   :0.06751  
##  homo_sapiens  :   10   3rd Qu.: 706.0   3rd Qu.:0.08025  
##  leghemoglobin :  101   Max.   :8797.0   Max.   :0.46474  
##  myoglobin     :  316                                     
##        P                 A                 V                 L          
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.04150   1st Qu.:0.05506   1st Qu.:0.04843   1st Qu.:0.08060  
##  Median :0.05481   Median :0.06960   Median :0.06040   Median :0.09804  
##  Mean   :0.06106   Mean   :0.07342   Mean   :0.06148   Mean   :0.09910  
##  3rd Qu.:0.07383   3rd Qu.:0.08772   3rd Qu.:0.07273   3rd Qu.:0.11683  
##  Max.   :0.38060   Max.   :0.40000   Max.   :0.33333   Max.   :0.32323  
##                                                                         
##        I                 M                 C                 F          
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.02883   1st Qu.:0.01521   1st Qu.:0.01181   1st Qu.:0.02637  
##  Median :0.04245   Median :0.02105   Median :0.01872   Median :0.03610  
##  Mean   :0.04283   Mean   :0.02232   Mean   :0.02315   Mean   :0.03751  
##  3rd Qu.:0.05556   3rd Qu.:0.02765   3rd Qu.:0.02765   3rd Qu.:0.04735  
##  Max.   :0.21538   Max.   :0.13836   Max.   :0.36816   Max.   :0.13636  
##                                                                         
##        Y                 W                  H                 K          
##  Min.   :0.00000   Min.   :0.000000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.01852   1st Qu.:0.006555   1st Qu.:0.01766   1st Qu.:0.03943  
##  Median :0.02653   Median :0.011450   Median :0.02431   Median :0.05628  
##  Mean   :0.02765   Mean   :0.012778   Mean   :0.02635   Mean   :0.05884  
##  3rd Qu.:0.03540   3rd Qu.:0.017237   3rd Qu.:0.03185   3rd Qu.:0.07438  
##  Max.   :0.23611   Max.   :0.232877   Max.   :0.30000   Max.   :0.28866  
##                                                                          
##        R                 Q                 N                 E          
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.04286   1st Qu.:0.03350   1st Qu.:0.02457   1st Qu.:0.05157  
##  Median :0.05500   Median :0.04368   Median :0.03488   Median :0.06652  
##  Mean   :0.05762   Mean   :0.04562   Mean   :0.03547   Mean   :0.06874  
##  3rd Qu.:0.06910   3rd Qu.:0.05462   3rd Qu.:0.04513   3rd Qu.:0.08281  
##  Max.   :0.47059   Max.   :0.98750   Max.   :0.14286   Max.   :0.27072  
##                                                                         
##        D                 S                 T          
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.03770   1st Qu.:0.06107   1st Qu.:0.04149  
##  Median :0.04791   Median :0.07531   Median :0.05098  
##  Mean   :0.04814   Mean   :0.07800   Mean   :0.05227  
##  3rd Qu.:0.05762   3rd Qu.:0.09185   3rd Qu.:0.06092  
##  Max.   :0.25000   Max.   :0.41660   Max.   :0.34949  
##

Box plots to understand how the distribution varies by % Amino Acid

par(mfrow=c(1,5))
  for(i in 2:22) {
  boxplot(trainset[,i], main = names(trainset)[i])
}

Compute the correlation matrix for the 21 variables

# Reduce total_aa data set to amino acids only
corrSet <- total_aa[,2:22]

#str(corrSet)

corrMatrix <- cor(corrSet) # Calculate correlation coefficients
#corrMatrix

# Determine Max and Min
#which(abs(corrMatrix) > 0.4, arr.ind = T) 
#Check for corrlation coefficients |corrMatrix(i,j)| > 0.4

MinCorrelation <- min(corrMatrix)
MinCorrelation

## [1] -0.3673635

MaxCorrelation <- corrMatrix[ corrMatrix > 0.35 & corrMatrix < 1 ]
MaxCorrelation

## [1] 0.3607724 0.3607724

Which samples give min/max correlations

which(corrMatrix < -0.36)

## [1]  49 129

which(corrMatrix > 0.35 & corrMatrix < 1)

## [1] 143 343

Generate Correlation plot of Amino Acids

corrplot(corrMatrix, method = "color", type = "upper")

Configure parallel processing

Parallel processing in caret can be accomplished with the parallel and doParallel packages. The following code loads the required libraries (note, these libraries also depend on the iterators and foreach libraries).

library(parallel)
library(doParallel)

## Loading required package: foreach

## Loading required package: iterators

cluster <- makeCluster(detectCores() - 1) 
# leave 1 core for OS
registerDoParallel(cluster)

Develop training model

Use randomForest::train() to train the model * classification (sqrt(p) where p is number of variables in x)

set.seed(1234)
# Time
start_RF <- Sys.time()

# Fit a model
# model <- train(trainset$Class ~ .,
#                data = trainset,
#                method = "rf",
#                trControl = trainControl(method = "cv", number = 5))
modFit.rf <- randomForest::randomForest(trainset$Class ~ ., 
                                        data = trainset[,c(3:22)],
                                        ntree = 50,
                                        replace = TRUE,
                                        nodesize = 5, #1(default)
                                        importance = TRUE,
                                        keep.forest = TRUE, #FALSE(default)
                                        norm.votes = TRUE, #TRUE(default)
                                        do.trace = TRUE, #FALSE(default)
                                        proximity = TRUE #FALSE(default)
                                        )

## ntree      OOB      1      2      3      4      5      6      7
##     1:   1.70%  0.81% 50.00%100.00% 62.96%100.00% 18.42%  9.17%
##     2:   1.57%  0.70% 85.71%100.00% 54.29% 85.71% 22.95% 11.48%
##     3:   1.54%  0.72% 88.89% 88.89% 54.55% 50.00% 22.97% 10.87%
##     4:   1.43%  0.68% 90.00% 70.00% 49.02% 62.50% 19.54% 10.20%
##     5:   1.33%  0.54% 90.00% 80.00% 47.27% 55.56% 21.28% 11.19%
##     6:   1.33%  0.57%100.00% 80.00% 40.35% 55.56% 25.77%  9.69%
##     7:   1.28%  0.50%100.00% 80.00% 46.55% 44.44% 22.45% 10.85%
##     8:   1.12%  0.38%100.00% 80.00% 43.10% 60.00% 20.41% 10.16%
##     9:   1.11%  0.37%100.00% 80.00% 39.66% 30.00% 24.00% 10.36%
##    10:   1.01%  0.29%100.00% 80.00% 39.66% 50.00% 22.00%  9.65%
##    11:   0.97%  0.29%100.00% 80.00% 43.10% 40.00% 21.00%  7.99%
##    12:   0.90%  0.22%100.00% 80.00% 39.66% 40.00% 21.00%  8.25%
##    13:   0.85%  0.19%100.00% 80.00% 39.66% 30.00% 19.80%  8.23%
##    14:   0.83%  0.17%100.00% 80.00% 39.66% 50.00% 18.81%  8.23%
##    15:   0.80%  0.17%100.00% 70.00% 34.48% 50.00% 19.80%  7.91%
##    16:   0.83%  0.15%100.00% 80.00% 39.66% 50.00% 20.79%  8.23%
##    17:   0.83%  0.15%100.00% 70.00% 41.38% 40.00% 22.77%  8.23%
##    18:   0.78%  0.11%100.00% 80.00% 41.38% 40.00% 21.78%  7.28%
##    19:   0.78%  0.12%100.00% 70.00% 41.38% 40.00% 20.79%  7.59%
##    20:   0.76%  0.11%100.00% 70.00% 39.66% 50.00% 21.78%  6.65%
##    21:   0.76%  0.11%100.00% 90.00% 39.66% 40.00% 20.79%  6.65%
##    22:   0.72%  0.11%100.00% 80.00% 32.76% 40.00% 20.79%  6.65%
##    23:   0.74%  0.09%100.00% 80.00% 39.66% 40.00% 20.79%  7.28%
##    24:   0.74%  0.09%100.00% 80.00% 37.93% 40.00% 22.77%  6.96%
##    25:   0.72%  0.08%100.00% 70.00% 39.66% 50.00% 19.80%  6.96%
##    26:   0.68%  0.06%100.00% 70.00% 37.93% 40.00% 19.80%  6.65%
##    27:   0.72%  0.06%100.00% 80.00% 41.38% 40.00% 19.80%  7.28%
##    28:   0.72%  0.05%100.00% 80.00% 41.38% 50.00% 19.80%  7.28%
##    29:   0.74%  0.07%100.00% 90.00% 41.38% 40.00% 20.79%  7.28%
##    30:   0.74%  0.05%100.00% 90.00% 41.38% 50.00% 21.78%  7.28%
##    31:   0.76%  0.07%100.00% 90.00% 41.38% 50.00% 21.78%  7.59%
##    32:   0.73%  0.06%100.00% 80.00% 43.10% 40.00% 20.79%  7.28%
##    33:   0.72%  0.05%100.00% 80.00% 41.38% 40.00% 20.79%  7.59%
##    34:   0.72%  0.06%100.00% 80.00% 41.38% 40.00% 20.79%  6.96%
##    35:   0.72%  0.05%100.00% 80.00% 41.38% 40.00% 21.78%  6.96%
##    36:   0.69%  0.05%100.00% 80.00% 37.93% 40.00% 20.79%  6.96%
##    37:   0.69%  0.06%100.00% 80.00% 37.93% 40.00% 19.80%  6.96%
##    38:   0.69%  0.07%100.00% 70.00% 34.48% 40.00% 19.80%  7.28%
##    39:   0.69%  0.06%100.00% 70.00% 34.48% 40.00% 21.78%  7.28%
##    40:   0.69%  0.05%100.00% 70.00% 34.48% 40.00% 22.77%  6.96%
##    41:   0.67%  0.05%100.00% 80.00% 34.48% 40.00% 20.79%  6.96%
##    42:   0.69%  0.05%100.00% 80.00% 36.21% 40.00% 21.78%  7.28%
##    43:   0.68%  0.04%100.00% 80.00% 36.21% 40.00% 21.78%  6.96%
##    44:   0.68%  0.03%100.00% 80.00% 37.93% 40.00% 20.79%  7.28%
##    45:   0.69%  0.03%100.00% 80.00% 37.93% 50.00% 20.79%  7.28%
##    46:   0.69%  0.03%100.00% 70.00% 39.66% 50.00% 20.79%  7.28%
##    47:   0.69%  0.05%100.00% 80.00% 36.21% 50.00% 20.79%  7.28%
##    48:   0.70%  0.04%100.00% 80.00% 37.93% 50.00% 21.78%  7.28%
##    49:   0.70%  0.04%100.00% 80.00% 37.93% 50.00% 21.78%  7.28%
##    50:   0.69%  0.04%100.00% 80.00% 37.93% 40.00% 21.78%  7.28%

# Time
end_RF <- Sys.time()
RF_time = end_RF - start_RF
RF_time

## Time difference of 23.95638 secs

De-register parallel processing cluster

After processing the data, we explicitly shut down the cluster by calling the stopCluster() function.

stopCluster(cluster)

Investigate Fit:

At this point we have a trained model in the fit object, and can take a number of steps to evaluate the suitability of this model, including accuracy and a confusion matrix that is based on comparing the modeled data to the held out folds.

# Let's check the model
modFit.rf

## 
## Call:
##  randomForest(formula = trainset$Class ~ ., data = trainset[,      c(3:22)], ntree = 50, replace = TRUE, nodesize = 5, importance = TRUE,      keep.forest = TRUE, norm.votes = TRUE, do.trace = TRUE, proximity = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 50
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 0.69%
## Confusion matrix:
##                erythrocruorin hemerythrin hemocyanin hemoglobin
## erythrocruorin          13052           0          0          4
## hemerythrin                10           0          0          0
## hemocyanin                  8           0          2          0
## hemoglobin                 20           0          0         36
## homo_sapiens                4           0          0          0
## leghemoglobin              22           0          0          0
## myoglobin                  23           0          0          0
##                homo_sapiens leghemoglobin myoglobin  class.error
## erythrocruorin            0             1         0 0.0003829364
## hemerythrin               0             0         0 1.0000000000
## hemocyanin                0             0         0 0.8000000000
## hemoglobin                0             0         2 0.3793103448
## homo_sapiens              6             0         0 0.4000000000
## leghemoglobin             0            79         0 0.2178217822
## myoglobin                 0             0       293 0.0727848101

Number of predictors sampled for spliting at each node.

modFit.rf$mtry

## [1] 4

err.rate

(classification only) vector error rates of the prediction on the input data, the i-th element being the (OOB) error rate for all trees up to the i-th.

#modFit.rf$err.rate

proximity

Should proximity measure among the rows be calculated?

#modFit.rf$proximity

forest
(a list that contains the entire forest; NULL if randomForest is run in unsupervised mode or if keep.forest=FALSE.

#modFit.rf$forest

Machine Set up

#sessionInfo()