File Name: EDA-Part1-Oxygen-binding.rmd

Summary:

One important question which I would like to answer with an exploratory data analysis is this data set sufficient to succesfully classify the seven classes of protein using a Random Forest machine learning approach? This should clearer as we test and further describe the data.

Code Key Protein Class
0 / Ctrl Controls consist of human proteins which do not bind oxygen
1 / Ery Erythrocruorin
2 / Hcy Hemocyanin
3 / Hgb Hemoglobin
4 / Hhe Hemerythrin
5 / Lgb Leghemoglobin
6 / Mgb Myoglobin

Libraries:

knitr::opts_chunk$set(echo = TRUE)

Libraries = c("readr", "psych", "knitr")

# Install if not present
for(p in Libraries){
    if(!require(p, character.only = TRUE))
        install.packages(p)
    library(p, character.only = TRUE)
}

A. Load Data

Import Single Amino Acid percent composition for first round of Exploratory Data Analysis.

complete_aa <- read_csv("complete_aa.csv")

B. Check Data Head

head(complete_aa)
## # A tibble: 6 x 22
##   Class TotalAA      G      P      A      V      L      I      M      C
##   <dbl>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1     0     393 0.0585 0.115  0.0611 0.0458 0.0814 0.0204 0.0305 0.0254
## 2     0    2016 0.0580 0.0526 0.0675 0.0590 0.113  0.0605 0.0342 0.0208
## 3     0    1210 0.0702 0.0620 0.0595 0.0579 0.0917 0.0570 0.0207 0.0496
## 4     0    2871 0.107  0.0610 0.0334 0.0376 0.0491 0.0519 0.0181 0.126 
## 5     0     770 0.0494 0.0455 0.0818 0.0844 0.0727 0.0312 0.0312 0.0234
## 6     0     781 0.0717 0.0448 0.101  0.0691 0.120  0.0423 0.0384 0.0141
## # ... with 12 more variables: F <dbl>, Y <dbl>, W <dbl>, H <dbl>, K <dbl>,
## #   R <dbl>, Q <dbl>, N <dbl>, E <dbl>, D <dbl>, S <dbl>, T <dbl>

C. Check Data Tail

tail(complete_aa)
## # A tibble: 6 x 22
##   Class TotalAA      G      P      A      V     L       I       M       C
##   <dbl>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl> <dbl>   <dbl>   <dbl>   <dbl>
## 1     6     103 0.0971 0.0194 0.107  0.0874 0.117 0.0194  0.0194  0.00971
## 2     6     156 0.0385 0.0385 0.0641 0.0577 0.122 0.0321  0.0256  0.0128 
## 3     6     147 0.0884 0.0340 0.0952 0.0952 0.116 0.0408  0.0204  0.0136 
## 4     6     143 0.0699 0.0490 0.105  0.0490 0.161 0.00699 0.0140  0.0140 
## 5     6     146 0.0959 0.0342 0.0959 0.123  0.130 0       0.00685 0.00685
## 6     6     141 0.106  0.0426 0.0638 0.0993 0.106 0.0142  0.0213  0.0213 
## # ... with 12 more variables: F <dbl>, Y <dbl>, W <dbl>, H <dbl>, K <dbl>,
## #   R <dbl>, Q <dbl>, N <dbl>, E <dbl>, D <dbl>, S <dbl>, T <dbl>

D. Find Dimensions:

The dataset consists of 22 variables and 16,952 observations, consisting of 7 classes, (0:6).

dim(complete_aa)
## [1] 16952    22

E. Examine Data Structure:

Note: Class is a numerical value and must be converted to a categorical before analysis.

str(complete_aa)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 16952 obs. of  22 variables:
##  $ Class  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ TotalAA: num  393 2016 1210 2871 770 ...
##  $ G      : num  0.0585 0.058 0.0702 0.1069 0.0494 ...
##  $ P      : num  0.1145 0.0526 0.062 0.061 0.0455 ...
##  $ A      : num  0.0611 0.0675 0.0595 0.0334 0.0818 ...
##  $ V      : num  0.0458 0.059 0.0579 0.0376 0.0844 ...
##  $ L      : num  0.0814 0.1126 0.0917 0.0491 0.0727 ...
##  $ I      : num  0.0204 0.0605 0.057 0.0519 0.0312 ...
##  $ M      : num  0.0305 0.0342 0.0207 0.0181 0.0312 ...
##  $ C      : num  0.0254 0.0208 0.0496 0.1261 0.0234 ...
##  $ F      : num  0.028 0.0615 0.0298 0.0293 0.0273 ...
##  $ Y      : num  0.0229 0.0268 0.0298 0.0324 0.026 ...
##  $ W      : num  0.01018 0.01637 0.01074 0.00453 0.01169 ...
##  $ H      : num  0.0305 0.0134 0.0256 0.0167 0.0325 ...
##  $ K      : num  0.0509 0.0412 0.0545 0.0387 0.0532 ...
##  $ R      : num  0.0662 0.0516 0.0496 0.0456 0.0481 ...
##  $ Q      : num  0.0382 0.0352 0.0405 0.0348 0.0468 ...
##  $ N      : num  0.0356 0.0382 0.0545 0.0655 0.0403 ...
##  $ E      : num  0.0763 0.067 0.0636 0.07 0.1195 ...
##  $ D      : num  0.0509 0.0466 0.0504 0.0603 0.0649 ...
##  $ S      : num  0.0967 0.0784 0.0694 0.0603 0.0455 ...
##  $ T      : num  0.056 0.0585 0.0529 0.0578 0.0649 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Class = col_double(),
##   ..   TotalAA = col_double(),
##   ..   G = col_double(),
##   ..   P = col_double(),
##   ..   A = col_double(),
##   ..   V = col_double(),
##   ..   L = col_double(),
##   ..   I = col_double(),
##   ..   M = col_double(),
##   ..   C = col_double(),
##   ..   F = col_double(),
##   ..   Y = col_double(),
##   ..   W = col_double(),
##   ..   H = col_double(),
##   ..   K = col_double(),
##   ..   R = col_double(),
##   ..   Q = col_double(),
##   ..   N = col_double(),
##   ..   E = col_double(),
##   ..   D = col_double(),
##   ..   S = col_double(),
##   ..   T = col_double()
##   .. )

F. Convert Class To Factor

Convert Class(numerical) to Factor of 7 Protein Classes(Prot_Class)

prot_class <- as.factor(complete_aa$Class)

typeof(complete_aa$Class)
## [1] "double"
class(complete_aa$Class)
## [1] "numeric"

G. Produce Summary

Produce Summary information for all 22 columns. The summary command produces a 6 number summary.

summary(complete_aa$TotalAA)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     4.0   253.0   432.5   591.9   707.0 34350.0

H. Using Package “psych::describe” commands

Using ‘describe’ we are able to investigate the first four moments.

For brevity, we will calculate 2 amino acids (R, H)

Single Letter Amino Acid
R Arginine
H Histidine
describe(complete_aa$R)
##    vars     n mean   sd median trimmed  mad min  max range skew kurtosis
## X1    1 16952 0.06 0.02   0.05    0.06 0.02   0 0.47  0.47 1.72    12.73
##    se
## X1  0
describe(complete_aa$H)
##    vars     n mean   sd median trimmed  mad min max range skew kurtosis se
## X1    1 16952 0.03 0.01   0.02    0.02 0.01   0 0.3   0.3 2.02    14.93  0

I. Testing for Missing Data

In R, missing values are represented by ‘NA’ (not available). This is not to be confused with impossible values (e.g., dividing by zero) are represented by ‘NaN’ (not a number).

kable(sapply(complete_aa, function(x) sum(is.na(x))))
x
Class 0
TotalAA 0
G 0
P 0
A 0
V 0
L 0
I 0
M 0
C 0
F 0
Y 0
W 0
H 0
K 0
R 0
Q 0
N 0
E 0
D 0
S 0
T 0
There are NO missing values in any of the 22 features.

J. Principle Component Analysis: princomp

princomp has a component called sdev that is the “Standard deviation” for the dataframes features, therefore we can calculate variance.

aa_PCA = princomp(complete_aa[,3:22])

aa_PCA$sdev^2
##       Comp.1       Comp.2       Comp.3       Comp.4       Comp.5 
## 1.790777e-03 1.326783e-03 1.071601e-03 8.286978e-04 6.449875e-04 
##       Comp.6       Comp.7       Comp.8       Comp.9      Comp.10 
## 5.647001e-04 4.849827e-04 4.710319e-04 4.113380e-04 3.585041e-04 
##      Comp.11      Comp.12      Comp.13      Comp.14      Comp.15 
## 3.045709e-04 2.891094e-04 2.577679e-04 2.283155e-04 1.690747e-04 
##      Comp.16      Comp.17      Comp.18      Comp.19      Comp.20 
## 1.678894e-04 1.422886e-04 9.969130e-05 7.136116e-05 9.345848e-07

The proportion of variance is the variance divided by the sum of all variances.

The Cumulative Proportion is the cumulative sum of the proportion of variance.

cumsum(aa_PCA$sdev^2 / sum(aa_PCA$sdev^2))
##    Comp.1    Comp.2    Comp.3    Comp.4    Comp.5    Comp.6    Comp.7 
## 0.1849135 0.3219155 0.4325676 0.5181379 0.5847386 0.6430488 0.6931275 
##    Comp.8    Comp.9   Comp.10   Comp.11   Comp.12   Comp.13   Comp.14 
## 0.7417657 0.7842400 0.8212587 0.8527083 0.8825614 0.9091782 0.9327538 
##   Comp.15   Comp.16   Comp.17   Comp.18   Comp.19   Comp.20 
## 0.9502122 0.9675483 0.9822408 0.9925348 0.9999035 1.0000000

Now you have the Cumulative Proportion values, just plot them.

plot(cumsum(aa_PCA$sdev^2 / sum(aa_PCA$sdev^2)), 
     main = "Cumulative Proportion of varainces of all 20 Amino Acids(AA)",
     xlab = "Principal Component",
     ylab = "Proportion of AA",
     ylim = c(0,1),
     type="b")

The contribution of each amino acid is hard to discern from this graphic alone, therefore I will use another library to investigate further.


K. Machine Settings:

Sys.info()[c(1:3,5)]
##                                               sysname 
##                                               "Linux" 
##                                               release 
##                                   "4.15.0-46-generic" 
##                                               version 
## "#49~16.04.1-Ubuntu SMP Tue Feb 12 17:45:24 UTC 2019" 
##                                               machine 
##                                              "x86_64"
sessionInfo()
## R version 3.4.4 (2018-03-15)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Linux Mint 18.3
## 
## Matrix products: default
## BLAS: /usr/lib/libblas/libblas.so.3.6.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] knitr_1.21   psych_1.8.12 readr_1.3.1 
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.0       magrittr_1.5     hms_0.4.2        mnormt_1.5-5    
##  [5] lattice_0.20-38  R6_2.3.0         rlang_0.3.0.1    fansi_0.4.0     
##  [9] highr_0.7        stringr_1.3.1    tools_3.4.4      parallel_3.4.4  
## [13] grid_3.4.4       nlme_3.1-137     xfun_0.4         utf8_1.1.4      
## [17] cli_1.0.1        htmltools_0.3.6  assertthat_0.2.0 yaml_2.2.0      
## [21] digest_0.6.18    tibble_1.4.2     crayon_1.3.4     evaluate_0.12   
## [25] rmarkdown_1.11   stringi_1.2.4    compiler_3.4.4   pillar_1.3.1    
## [29] foreign_0.8-71   pkgconfig_2.0.2

EOF