One important question which I would like to answer with an exploratory data analysis is this data set sufficient to succesfully classify the seven classes of protein using a Random Forest machine learning approach? This should clearer as we test and further describe the data.
| Code Key | Protein Class |
|---|---|
| 0 / Ctrl | Controls consist of human proteins which do not bind oxygen |
| 1 / Ery | Erythrocruorin |
| 2 / Hcy | Hemocyanin |
| 3 / Hgb | Hemoglobin |
| 4 / Hhe | Hemerythrin |
| 5 / Lgb | Leghemoglobin |
| 6 / Mgb | Myoglobin |
Libraries:
knitr::opts_chunk$set(echo = TRUE)
Libraries = c("readr", "psych", "knitr")
# Install if not present
for(p in Libraries){
if(!require(p, character.only = TRUE))
install.packages(p)
library(p, character.only = TRUE)
}
Import Single Amino Acid percent composition for first round of Exploratory Data Analysis.
complete_aa <- read_csv("complete_aa.csv")
head(complete_aa)
## # A tibble: 6 x 22
## Class TotalAA G P A V L I M C
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 393 0.0585 0.115 0.0611 0.0458 0.0814 0.0204 0.0305 0.0254
## 2 0 2016 0.0580 0.0526 0.0675 0.0590 0.113 0.0605 0.0342 0.0208
## 3 0 1210 0.0702 0.0620 0.0595 0.0579 0.0917 0.0570 0.0207 0.0496
## 4 0 2871 0.107 0.0610 0.0334 0.0376 0.0491 0.0519 0.0181 0.126
## 5 0 770 0.0494 0.0455 0.0818 0.0844 0.0727 0.0312 0.0312 0.0234
## 6 0 781 0.0717 0.0448 0.101 0.0691 0.120 0.0423 0.0384 0.0141
## # ... with 12 more variables: F <dbl>, Y <dbl>, W <dbl>, H <dbl>, K <dbl>,
## # R <dbl>, Q <dbl>, N <dbl>, E <dbl>, D <dbl>, S <dbl>, T <dbl>
tail(complete_aa)
## # A tibble: 6 x 22
## Class TotalAA G P A V L I M C
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 6 103 0.0971 0.0194 0.107 0.0874 0.117 0.0194 0.0194 0.00971
## 2 6 156 0.0385 0.0385 0.0641 0.0577 0.122 0.0321 0.0256 0.0128
## 3 6 147 0.0884 0.0340 0.0952 0.0952 0.116 0.0408 0.0204 0.0136
## 4 6 143 0.0699 0.0490 0.105 0.0490 0.161 0.00699 0.0140 0.0140
## 5 6 146 0.0959 0.0342 0.0959 0.123 0.130 0 0.00685 0.00685
## 6 6 141 0.106 0.0426 0.0638 0.0993 0.106 0.0142 0.0213 0.0213
## # ... with 12 more variables: F <dbl>, Y <dbl>, W <dbl>, H <dbl>, K <dbl>,
## # R <dbl>, Q <dbl>, N <dbl>, E <dbl>, D <dbl>, S <dbl>, T <dbl>
The dataset consists of 22 variables and 16,952 observations, consisting of 7 classes, (0:6).
dim(complete_aa)
## [1] 16952 22
Note: Class is a numerical value and must be converted to a categorical before analysis.
str(complete_aa)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 16952 obs. of 22 variables:
## $ Class : num 0 0 0 0 0 0 0 0 0 0 ...
## $ TotalAA: num 393 2016 1210 2871 770 ...
## $ G : num 0.0585 0.058 0.0702 0.1069 0.0494 ...
## $ P : num 0.1145 0.0526 0.062 0.061 0.0455 ...
## $ A : num 0.0611 0.0675 0.0595 0.0334 0.0818 ...
## $ V : num 0.0458 0.059 0.0579 0.0376 0.0844 ...
## $ L : num 0.0814 0.1126 0.0917 0.0491 0.0727 ...
## $ I : num 0.0204 0.0605 0.057 0.0519 0.0312 ...
## $ M : num 0.0305 0.0342 0.0207 0.0181 0.0312 ...
## $ C : num 0.0254 0.0208 0.0496 0.1261 0.0234 ...
## $ F : num 0.028 0.0615 0.0298 0.0293 0.0273 ...
## $ Y : num 0.0229 0.0268 0.0298 0.0324 0.026 ...
## $ W : num 0.01018 0.01637 0.01074 0.00453 0.01169 ...
## $ H : num 0.0305 0.0134 0.0256 0.0167 0.0325 ...
## $ K : num 0.0509 0.0412 0.0545 0.0387 0.0532 ...
## $ R : num 0.0662 0.0516 0.0496 0.0456 0.0481 ...
## $ Q : num 0.0382 0.0352 0.0405 0.0348 0.0468 ...
## $ N : num 0.0356 0.0382 0.0545 0.0655 0.0403 ...
## $ E : num 0.0763 0.067 0.0636 0.07 0.1195 ...
## $ D : num 0.0509 0.0466 0.0504 0.0603 0.0649 ...
## $ S : num 0.0967 0.0784 0.0694 0.0603 0.0455 ...
## $ T : num 0.056 0.0585 0.0529 0.0578 0.0649 ...
## - attr(*, "spec")=
## .. cols(
## .. Class = col_double(),
## .. TotalAA = col_double(),
## .. G = col_double(),
## .. P = col_double(),
## .. A = col_double(),
## .. V = col_double(),
## .. L = col_double(),
## .. I = col_double(),
## .. M = col_double(),
## .. C = col_double(),
## .. F = col_double(),
## .. Y = col_double(),
## .. W = col_double(),
## .. H = col_double(),
## .. K = col_double(),
## .. R = col_double(),
## .. Q = col_double(),
## .. N = col_double(),
## .. E = col_double(),
## .. D = col_double(),
## .. S = col_double(),
## .. T = col_double()
## .. )
Convert Class(numerical) to Factor of 7 Protein Classes(Prot_Class)
prot_class <- as.factor(complete_aa$Class)
typeof(complete_aa$Class)
## [1] "double"
class(complete_aa$Class)
## [1] "numeric"
Produce Summary information for all 22 columns. The summary command produces a 6 number summary.
summary(complete_aa$TotalAA)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.0 253.0 432.5 591.9 707.0 34350.0
Using ‘describe’ we are able to investigate the first four moments.
For brevity, we will calculate 2 amino acids (R, H)
| Single Letter | Amino Acid |
|---|---|
| R | Arginine |
| H | Histidine |
describe(complete_aa$R)
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 16952 0.06 0.02 0.05 0.06 0.02 0 0.47 0.47 1.72 12.73
## se
## X1 0
describe(complete_aa$H)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 16952 0.03 0.01 0.02 0.02 0.01 0 0.3 0.3 2.02 14.93 0
In R, missing values are represented by ‘NA’ (not available). This is not to be confused with impossible values (e.g., dividing by zero) are represented by ‘NaN’ (not a number).
kable(sapply(complete_aa, function(x) sum(is.na(x))))
| x | |
|---|---|
| Class | 0 |
| TotalAA | 0 |
| G | 0 |
| P | 0 |
| A | 0 |
| V | 0 |
| L | 0 |
| I | 0 |
| M | 0 |
| C | 0 |
| F | 0 |
| Y | 0 |
| W | 0 |
| H | 0 |
| K | 0 |
| R | 0 |
| Q | 0 |
| N | 0 |
| E | 0 |
| D | 0 |
| S | 0 |
| T | 0 |
| There are | NO missing values in any of the 22 features. |
princompprincomp has a component called sdev that is the “Standard deviation” for the dataframes features, therefore we can calculate variance.
aa_PCA = princomp(complete_aa[,3:22])
aa_PCA$sdev^2
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## 1.790777e-03 1.326783e-03 1.071601e-03 8.286978e-04 6.449875e-04
## Comp.6 Comp.7 Comp.8 Comp.9 Comp.10
## 5.647001e-04 4.849827e-04 4.710319e-04 4.113380e-04 3.585041e-04
## Comp.11 Comp.12 Comp.13 Comp.14 Comp.15
## 3.045709e-04 2.891094e-04 2.577679e-04 2.283155e-04 1.690747e-04
## Comp.16 Comp.17 Comp.18 Comp.19 Comp.20
## 1.678894e-04 1.422886e-04 9.969130e-05 7.136116e-05 9.345848e-07
The proportion of variance is the variance divided by the sum of all variances.
The Cumulative Proportion is the cumulative sum of the proportion of variance.
cumsum(aa_PCA$sdev^2 / sum(aa_PCA$sdev^2))
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## 0.1849135 0.3219155 0.4325676 0.5181379 0.5847386 0.6430488 0.6931275
## Comp.8 Comp.9 Comp.10 Comp.11 Comp.12 Comp.13 Comp.14
## 0.7417657 0.7842400 0.8212587 0.8527083 0.8825614 0.9091782 0.9327538
## Comp.15 Comp.16 Comp.17 Comp.18 Comp.19 Comp.20
## 0.9502122 0.9675483 0.9822408 0.9925348 0.9999035 1.0000000
Now you have the Cumulative Proportion values, just plot them.
plot(cumsum(aa_PCA$sdev^2 / sum(aa_PCA$sdev^2)),
main = "Cumulative Proportion of varainces of all 20 Amino Acids(AA)",
xlab = "Principal Component",
ylab = "Proportion of AA",
ylim = c(0,1),
type="b")
The contribution of each amino acid is hard to discern from this graphic alone, therefore I will use another library to investigate further.
Sys.info()[c(1:3,5)]
## sysname
## "Linux"
## release
## "4.15.0-46-generic"
## version
## "#49~16.04.1-Ubuntu SMP Tue Feb 12 17:45:24 UTC 2019"
## machine
## "x86_64"
sessionInfo()
## R version 3.4.4 (2018-03-15)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Linux Mint 18.3
##
## Matrix products: default
## BLAS: /usr/lib/libblas/libblas.so.3.6.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] knitr_1.21 psych_1.8.12 readr_1.3.1
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.0 magrittr_1.5 hms_0.4.2 mnormt_1.5-5
## [5] lattice_0.20-38 R6_2.3.0 rlang_0.3.0.1 fansi_0.4.0
## [9] highr_0.7 stringr_1.3.1 tools_3.4.4 parallel_3.4.4
## [13] grid_3.4.4 nlme_3.1-137 xfun_0.4 utf8_1.1.4
## [17] cli_1.0.1 htmltools_0.3.6 assertthat_0.2.0 yaml_2.2.0
## [21] digest_0.6.18 tibble_1.4.2 crayon_1.3.4 evaluate_0.12
## [25] rmarkdown_1.11 stringi_1.2.4 compiler_3.4.4 pillar_1.3.1
## [29] foreign_0.8-71 pkgconfig_2.0.2
EOF