Introduction

The following script illustrates the relationships the 20 amino acids have with their own characteristics. These characteristics include their polarity, molecular weight, etc. These relationships have been displayed in a multitude of ways using the packages lattice, ggplot2, ggpubr, vegan, and permute. Such comparisons span from plotting highly correlated and independent characteristics in a scatterplot to using PCA for multi-dimensional analysis. The relationships amino acids have with their own characteristics are important biological phenomena as any possible correlations can help with aspects of the discipline such as drug discovery and protein interactions.

Preliminaries

Packages

The following script uses ggplot2 and ggpubr for many of its plotting. ggplot2 and ggpubr both share a wrapper relationship in that they both act as functions that can run other functions for the user. We use these packages to highlight relationships between amino acid variables in higher dimensional or standard 2d models.

# installing packages. only need to be done once.
#install.packages("permute")
#install.packages("lattice")
#install.packages("ggplot2")
#install.packages("ggpubr")
#install.packages("vegan")
#install.packages("colorspace")
# downloading packages 
library(permute)
library(lattice)
library(ggplot2)
library(ggpubr)
library(vegan)
## This is vegan 2.5-6
library(colorspace)
library(scatterplot3d)

Building the dataframe

The following creates vectors that will each act as variables to describe the characteristics of each amino acid in a dataframe dubbed ‘aa_dat.’ These data are derived mostly from Higgs (2009) Table 1. They are similar to Higgs and Attwood 2005 but have extra columns. Nathan Brouwer added additional categorical varibles based on online information.

# 1 letter code
aa <-c('A','C','D','E','F','G','H','I','K','L','M','N',
'P','Q','R','S','T','V','W','Y')

## molecular weight in dalton
MW.da <-c(89,121,133,146,165,75,155,131,146,131,149,132,115,147,174,105,119,117,204,181)

## vol from van der Waals radii
vol <-c(67,86,91,109,135,48,118,124,135,124,124,96,90,
114,148,73,93,105,163,141)

## bulk – a measure of the shape of the side chain
bulk <-c(11.5,13.46,11.68,13.57,19.8,3.4,13.69,21.4,
15.71,21.4,16.25,12.28,17.43,
14.45,14.28,9.47,15.77,21.57,21.67,18.03)

## pol – a measure of the electric field strength around the molecule
pol <-c(0,1.48,49.7,49.9,0.35,0,51.6,0.13,49.5,0.13,
1.43,3.38,1.58,3.53,52,1.67,1.66,0.13,2.1,1.61)

## isoelec point
isoelec <-c(6,5.07,2.77,3.22,5.48,5.97,7.59,6.02,9.74,5.98,
5.74,5.41,6.3,5.65,10.76,5.68,6.16,5.96,5.89,5.66)

## 1st Hydrophobicity scale
H2Opho.34 <-c(1.8,2.5,-3.5,-3.5,2.8,-0.4,-3.2,4.5,-3.9,3.8,1.9,
-3.5,-1.6,-3.5,-4.5,-0.8,-0.7,4.2,-0.9,-1.3)

## 2nd Hydrophobicity scale
H2Opho.35 <-c(1.6,2,-9.2,-8.2,3.7,1,-3,3.1,-8.8,2.8,3.4,-4.8,
-0.2,-4.1,-12.3,0.6,1.2,2.6,1.9,-0.7)

## Surface area accessible to water in an unfolded peptide
saaH2O <-c(113,140,151,183,218,85,194,182,211,180,204,158,
143,189,241,122,146,160,259,229)

## Fraction of accessible area lost when a protein folds
faal.fold <-c(0.74,0.91,0.62,0.62,0.88,0.72,0.78,0.88,0.52,
0.85,0.85,0.63,0.64,0.62,0.64,0.66,0.7,0.86,0.85,0.76)

# Polar requirement
polar.req <-c(7,4.8,13,12.5,5,7.9,8.4,4.9,10.1,4.9,5.3,10,
6.6,8.6,9.1,7.5,6.6,5.6,5.2,5.4)

## relative frequency of occurance
## "The frequencies column shows the mean
### percentage of each amino acid in the protein sequences
### of modern organisms"
freq <-c(7.8,1.1,5.19,6.72,4.39,6.77,2.03,6.95,6.32,
10.15,2.28,4.37,4.26,3.45,5.23,6.46,5.12,7.01,1.09,3.3)

# charges
## un = Un-charged
## neg = negative
## pos = positive
charge<-c('un','un','neg','neg','un','un','pos','un','pos',
          'un','un','un','un','un','pos','un','un','un','un','un')

## hydropathy
hydropathy<-c('hydrophobic','hydrophobic','hydrophilic','
hydrophilic','hydrophobic','neutral','neutral',
'hydrophobic','hydrophilic','hydrophobic','hydrophobic',
'hydrophilic','neutral',
'hydrophilic','hydrophilic','neutral','neutral',
'hydrophobic','hydrophobic','neutral')

## vol
vol.cat<-c('verysmall','small','small','medium',
'verylarge','verysmall','medium','large','large',
'large','large','small','small','medium','large',
'verysmall',
'small','medium','verylarge','verylarge')

## pol
pol.cat<-c('nonpolar','nonpolar','polar','polar',
'nonpolar','nonpolar','polar','nonpolar',
'polar','nonpolar','nonpolar','polar','nonpolar','polar',
'polar','polar','polar','nonpolar','nonpolar','polar')
## chemical

chemical<-c('aliphatic','sulfur','acidic','acidic','aromatic',
'aliphatic','basic','aliphatic','basic','aliphatic','sulfur',
'amide','aliphatic','amide','basic','hydroxyl','hydroxyl',
'aliphatic','aromatic','aromatic')


#creating dataframe from the above datasets.
aa_dat <- data.frame(aa=aa,
                     MW.da=MW.da,
                     vol=vol,
                     bulk=bulk,
                     pol.quant=pol,
                     isoelec=isoelec,
                     H2Opho.34=H2Opho.34,
                     H2O.pho.35=H2Opho.35,
                     saaH2O=saaH2O,
                     faal.fold=faal.fold,
                     polar.req=polar.req,
                     freq=freq,
                     charge=charge,
                     hydropathy=hydropathy,
                     vol.cat=vol.cat,
                     pol.cat=pol.cat,
                     chemical=chemical
                     )
aa_dat
##    aa MW.da vol  bulk pol.quant isoelec H2Opho.34 H2O.pho.35 saaH2O faal.fold
## 1   A    89  67 11.50      0.00    6.00       1.8        1.6    113      0.74
## 2   C   121  86 13.46      1.48    5.07       2.5        2.0    140      0.91
## 3   D   133  91 11.68     49.70    2.77      -3.5       -9.2    151      0.62
## 4   E   146 109 13.57     49.90    3.22      -3.5       -8.2    183      0.62
## 5   F   165 135 19.80      0.35    5.48       2.8        3.7    218      0.88
## 6   G    75  48  3.40      0.00    5.97      -0.4        1.0     85      0.72
## 7   H   155 118 13.69     51.60    7.59      -3.2       -3.0    194      0.78
## 8   I   131 124 21.40      0.13    6.02       4.5        3.1    182      0.88
## 9   K   146 135 15.71     49.50    9.74      -3.9       -8.8    211      0.52
## 10  L   131 124 21.40      0.13    5.98       3.8        2.8    180      0.85
## 11  M   149 124 16.25      1.43    5.74       1.9        3.4    204      0.85
## 12  N   132  96 12.28      3.38    5.41      -3.5       -4.8    158      0.63
## 13  P   115  90 17.43      1.58    6.30      -1.6       -0.2    143      0.64
## 14  Q   147 114 14.45      3.53    5.65      -3.5       -4.1    189      0.62
## 15  R   174 148 14.28     52.00   10.76      -4.5      -12.3    241      0.64
## 16  S   105  73  9.47      1.67    5.68      -0.8        0.6    122      0.66
## 17  T   119  93 15.77      1.66    6.16      -0.7        1.2    146      0.70
## 18  V   117 105 21.57      0.13    5.96       4.2        2.6    160      0.86
## 19  W   204 163 21.67      2.10    5.89      -0.9        1.9    259      0.85
## 20  Y   181 141 18.03      1.61    5.66      -1.3       -0.7    229      0.76
##    polar.req  freq charge    hydropathy   vol.cat  pol.cat  chemical
## 1        7.0  7.80     un   hydrophobic verysmall nonpolar aliphatic
## 2        4.8  1.10     un   hydrophobic     small nonpolar    sulfur
## 3       13.0  5.19    neg   hydrophilic     small    polar    acidic
## 4       12.5  6.72    neg \nhydrophilic    medium    polar    acidic
## 5        5.0  4.39     un   hydrophobic verylarge nonpolar  aromatic
## 6        7.9  6.77     un       neutral verysmall nonpolar aliphatic
## 7        8.4  2.03    pos       neutral    medium    polar     basic
## 8        4.9  6.95     un   hydrophobic     large nonpolar aliphatic
## 9       10.1  6.32    pos   hydrophilic     large    polar     basic
## 10       4.9 10.15     un   hydrophobic     large nonpolar aliphatic
## 11       5.3  2.28     un   hydrophobic     large nonpolar    sulfur
## 12      10.0  4.37     un   hydrophilic     small    polar     amide
## 13       6.6  4.26     un       neutral     small nonpolar aliphatic
## 14       8.6  3.45     un   hydrophilic    medium    polar     amide
## 15       9.1  5.23    pos   hydrophilic     large    polar     basic
## 16       7.5  6.46     un       neutral verysmall    polar  hydroxyl
## 17       6.6  5.12     un       neutral     small    polar  hydroxyl
## 18       5.6  7.01     un   hydrophobic    medium nonpolar aliphatic
## 19       5.2  1.09     un   hydrophobic verylarge nonpolar  aromatic
## 20       5.4  3.30     un       neutral verylarge    polar  aromatic

Raw Data Exploration

Correlation Matrix

The following displays a correlation matrix. The columns of the data set are set against each other and the two value’s calculated correlation coefficient is plotted.

cor_ <- round(cor(aa_dat[,-c(1,13:17)]),2)
diag(cor_) <- NA
cor_[upper.tri(cor_)] <- NA
cor_
##            MW.da   vol  bulk pol.quant isoelec H2Opho.34 H2O.pho.35 saaH2O
## MW.da         NA    NA    NA        NA      NA        NA         NA     NA
## vol         0.93    NA    NA        NA      NA        NA         NA     NA
## bulk        0.55  0.73    NA        NA      NA        NA         NA     NA
## pol.quant   0.29  0.24 -0.20        NA      NA        NA         NA     NA
## isoelec     0.20  0.36  0.08      0.27      NA        NA         NA     NA
## H2Opho.34  -0.27 -0.08  0.44     -0.67   -0.20        NA         NA     NA
## H2O.pho.35 -0.25 -0.16  0.32     -0.85   -0.26      0.85         NA     NA
## saaH2O      0.97  0.99  0.64      0.29    0.35     -0.18      -0.23     NA
## faal.fold   0.11  0.18  0.49     -0.53   -0.18      0.84       0.79   0.12
## polar.req  -0.05 -0.19 -0.53      0.76   -0.11     -0.79      -0.87  -0.11
## freq       -0.52 -0.30 -0.04     -0.01    0.02      0.26      -0.02  -0.38
##            faal.fold polar.req freq
## MW.da             NA        NA   NA
## vol               NA        NA   NA
## bulk              NA        NA   NA
## pol.quant         NA        NA   NA
## isoelec           NA        NA   NA
## H2Opho.34         NA        NA   NA
## H2O.pho.35        NA        NA   NA
## saaH2O            NA        NA   NA
## faal.fold         NA        NA   NA
## polar.req      -0.81        NA   NA
## freq           -0.18      0.14   NA

The following code displays the coordinates of the cell holding the maximum and minimum correlation coefficient respectively. In other words, this tells us what two variables are the most and least correlated with each other among the 20 amino acids. From this, we see saaH2O and vol have the highest correlation of 0.99 and polar.req and H2Opho.35 have the smallest correlation of -0.87.

#getting the maxmimum correlation
which(cor_ == max(cor_, na.rm = T), arr.ind = T)
##        row col
## saaH2O   8   2
# getting the minimum correlation 
which(cor_ == min(cor_, na.rm = T), arr.ind = T)
##           row col
## polar.req  10   7

Plot 0: Scatterplot matrix

panel.cor is a function to help plot a correlation matrix for the newly created dataframe. For typical matrices, only half of the matrix is needed as the other half mirrors itself. panel.cor makes it so the matrix will show each cell’s corresponding correlation coefficient to eliminate any superfluous data.

panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor,...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r)
}

The scatterplot matrix below shows the relationships molecular weight, volume, polarity, hydrophobicity, H2O surface area, accessible area lost during folding, and polar requirement among the amino acids have with each other. From this matrix, we see that the relationship MW.da-vol, H2Opho.34-pol.quant, saaH2O-vol, saaH2O-MW.da, faal.fold-H2Opho.34, polar.req-faal.fold, polar.req-H2Opho.34 and polar.req-pol.quant relationships all have strong correlations, while every other relationship besides the faal.fold-pol.quant relationship are all weakly correlated. From the smoothers added to each scatterplot, we see the H2Opho.34-saaH2O, H2Opho.34-polar.req, MW.da-pol.quant and MW.da-faal.fold all have very notable non-linear relationships.

plot(aa_dat[-c(1,4,6,8,12:17)], upper.panel = panel.cor,
     panel = panel.smooth)

Plot 1: Replication Higgs and Attwood Figure 2.8

The plot displays each amino acid’s polar requirement and frequency relationship. We see a very weak correlation between the two.

#plotting each amino acid's polar requirement and frquency together using plot()
#xlab represents the x axis label
#ylab represents the y axis label
#main represents the title of the plot

#plot() creates an empty plot, adds the labels for the x and y axis and labels the title
#text() adds points to the empty plot. In this case, text() adds points for each amino acid's polar.req and freq relationship.

plot(polar.req  ~ freq, data = aa_dat,
     xlab = "pI",
     ylab = "Volume",
     main = "Volume vs polar requirement",
     col = 0)
text(polar.req ~ freq, 
     labels = aa, 
     data = aa_dat, 
     col = 1:20)

Figure 2: HA Figure 2.8 with ggpubr

The plot below displays a scatter plot matrix plotting frequency and polar requirement of the 20 amino acids. Additional variables are illustrated: each point has varying sizes depending on its volume and have a color from a gradient depending on its molecular weight.

ggscatter(y = "polar.req",
x = "freq",
size = "vol",
color = "MW.da",
data = aa_dat,
xlab = "frequency",
ylab = "polar requirement")

Figure 3: Highly correlated data

By plotting amino acid volume and surface area exposed to water of an unfolded peptide, we see that the columns are highly correlated. This is additionally visualized from the regression line and data ellipse.

ggscatter(y = "saaH2O",
          x = "vol",
          size = "saaH2O",
          data = aa_dat,
          xlab = "saaH2O",
          ylab = "volume", 
          add="reg.line", #adding regression line 
          cor.coef=TRUE,  #adding correlation coefficient
          ellipse = TRUE  #adding ellipse 
         )
## `geom_smooth()` using formula 'y ~ x'

The general formula to regression lines is y=b0+b1x. Where b0 is the y intercept and b1 is the slope of the regression line. Data ellipses, such as the one below, show for both the x and y axis the deviation of values among that respective axis.

ggscatter(y = "polar.req",
          x = "freq",
          size = "polar.req",
          add = "reg.line", # line of best fit
          ellipse = TRUE, # data ellipse
          cor.coef = TRUE, # correlation coef
          # color = "freq",
          data = aa_dat,
          xlab = "Frequency",
          ylab = "Polar Requirement")
## `geom_smooth()` using formula 'y ~ x'

Figure 4: Non-linear relationship

When plotting the polarity and molecular weight among the 20 among amino acids, we see a non-linear relationship.

ggscatter(y = "pol.quant",
          x = "MW.da",
          size = "bulk",
          add = "loess", # adding smoother
          color = "freq",
          data = aa_dat,
          xlab = "Molecular Weight",
          ylab = "Polarity")
## `geom_smooth()` using formula 'y ~ x'

Figure 5: Non-linear relationship on LOG scale

Allometry is the concept in biology that morphological, metabolic, and other such characteristics of living thing are proportional (or correlated) to its size. As such, by taking the log of the two biological variables plotted against each other, we correct this proportionality to a scatter plot that is easier to read and study.

# making a separate dataframe from aa_dat to hold the logged value 
aa_dat_log <- aa_dat
aa_dat_log$pol.quant_log <- log(aa_dat$pol.quant)
aa_dat_log$MW.da_log <- log(aa_dat$MW.da)

ggscatter(y = "pol.quant_log",
          x = "MW.da_log",
          size = "bulk",
          add = "reg.line", # adding smoother
          color = "freq",
          data = aa_dat_log,
          xlab = "Molecular Weight",
          ylab = "Polarity")
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (stat_smooth).

Figure 6: 3D Scatterplot

In the following 3D scatterplot, we see that the three variables plotted, molecular weight, volume, and surface area accessible to water of an unfolded peptide, are positively correlated. In other words, we see that when one of the variables increase, so do the other two for all three variables. We also see that the inverse, converse, and contrapositive is true.

# converting aa_dat into a data frame
aa_dat_df <-data.frame(aa_dat)

# plotting MW.da, vol, and saaH2O into 3d scatter plot
scatterplot3d(x=aa_dat_df$MW.da,
              y=aa_dat_df$vol,
              z=aa_dat_df$saaH2O,
              highlight.3d=TRUE,
              type="h")

Principal component analysis

Principal component analysis, PCA, is a dimension reduction method that displays the relationship of three or more variables in a 2d cartesian plane. This process of which this happens includes the clustering of data points by rotating the higher dimensional plot in space in way that “if we spot patterns in the data in the PC plot, such as clusters of closely spaced points, then these are likely to give a true impression of the patterns in the full data”; therefore, giving us the 2d representation of our data while conserving the variables’ relationships (Higgs and Attwood 2005, pg 29). There are various ways to produce PCA plots. Two ways of performing PCA is using base R and vegan. Base R uses default R functions to plot the PCA such as prcomp() and bigplot(). Alternatively, vegan comes from a package which uses rda() for perfoming PCA that has additional features.

Data Preparation

In preparation for performing PCA on this dataset, a new dataset was produced that only included the numeric variables. This was done first by finding the indexes of the variables that were not numeric and reverse concatenating them from the aa_dat data set into the new aa_dat2 dataset thus removing them.

aa_dat2 <- aa_dat[,-c(1,13:length(aa_dat))]

aa_dat2
##    MW.da vol  bulk pol.quant isoelec H2Opho.34 H2O.pho.35 saaH2O faal.fold
## 1     89  67 11.50      0.00    6.00       1.8        1.6    113      0.74
## 2    121  86 13.46      1.48    5.07       2.5        2.0    140      0.91
## 3    133  91 11.68     49.70    2.77      -3.5       -9.2    151      0.62
## 4    146 109 13.57     49.90    3.22      -3.5       -8.2    183      0.62
## 5    165 135 19.80      0.35    5.48       2.8        3.7    218      0.88
## 6     75  48  3.40      0.00    5.97      -0.4        1.0     85      0.72
## 7    155 118 13.69     51.60    7.59      -3.2       -3.0    194      0.78
## 8    131 124 21.40      0.13    6.02       4.5        3.1    182      0.88
## 9    146 135 15.71     49.50    9.74      -3.9       -8.8    211      0.52
## 10   131 124 21.40      0.13    5.98       3.8        2.8    180      0.85
## 11   149 124 16.25      1.43    5.74       1.9        3.4    204      0.85
## 12   132  96 12.28      3.38    5.41      -3.5       -4.8    158      0.63
## 13   115  90 17.43      1.58    6.30      -1.6       -0.2    143      0.64
## 14   147 114 14.45      3.53    5.65      -3.5       -4.1    189      0.62
## 15   174 148 14.28     52.00   10.76      -4.5      -12.3    241      0.64
## 16   105  73  9.47      1.67    5.68      -0.8        0.6    122      0.66
## 17   119  93 15.77      1.66    6.16      -0.7        1.2    146      0.70
## 18   117 105 21.57      0.13    5.96       4.2        2.6    160      0.86
## 19   204 163 21.67      2.10    5.89      -0.9        1.9    259      0.85
## 20   181 141 18.03      1.61    5.66      -1.3       -0.7    229      0.76
##    polar.req  freq
## 1        7.0  7.80
## 2        4.8  1.10
## 3       13.0  5.19
## 4       12.5  6.72
## 5        5.0  4.39
## 6        7.9  6.77
## 7        8.4  2.03
## 8        4.9  6.95
## 9       10.1  6.32
## 10       4.9 10.15
## 11       5.3  2.28
## 12      10.0  4.37
## 13       6.6  4.26
## 14       8.6  3.45
## 15       9.1  5.23
## 16       7.5  6.46
## 17       6.6  5.12
## 18       5.6  7.01
## 19       5.2  1.09
## 20       5.4  3.30

Figure 7: Principal components analysis - base R

#turning the new numeric dataset into a dataframe
aa_dat2 <- data.frame(aa_dat2)

#running the PCA then plotting it with bigplot()
pca.out <- prcomp(aa_dat2, scale = TRUE)

biplot(pca.out)

Figure 8: Principal components analysis - vegan

The argument scale is set to TRUE. This argument scales the axis to make it more readable with the use of the unit variance. The following PCA plot sets its group variable to charge. In comparison to Figure 2.10, page 28, of Higgs and Atwood, we see there are more defined clusters in this plot than Higgs and Atwood’s. The spreads on the first component is a lot greater in the Higgs and Atwood plot, but the spread of the second component between the two are similar. This plot also visually clusters the data points together unlike the Higgs and Atoowd PCA.

rda.out <- vegan::rda(aa_dat2, scale = TRUE)

rda_scores <- scores(rda.out)


biplot(rda.out, display = "sites", col = 0,)
orditorp(rda.out, display = "sites", labels = aa_dat$aa, cex = 1.2)
ordihull(rda.out,
group = aa_dat$charge,
col = 1:7,
lty = 1:7,
lwd = c(3,6))

Cluster analysis

The purpose of cluster analysis is to find patterns among a large set of data by categorizing them into clusters. Cluster analysis typically deals with raw data so any cluster analysis algorithm does not know before-hand what rows of data are related or not. UPGMA uses a form of cluster analysis called hierarchical cluster analysis. The general way it works is first a distance matrix is created from the raw dataset and the rows are grouped by averaging the distances of each row. Finally, the branch lengths of each data point is then calculated and the distances are recalculated accordingly.

Hierarchical cluster analysis

Euclidean distance a form of distance that follows the basic pythagorean theorem to calculate the distance between two points. The typical paradigm is that between two points, a right triangle is formed where the square root of the sum of the squares of the two legs of the triangle is the euclidean distance.

This dendrogram clusters amino acids based off their characteristics while the Higgs and Atwood’s dendrogram also specifies what the amino acids are clustered by:basic, acid + amide, small, hydrophobic, and aromatic. Compared to Plate 2.2 of Higgs and Atwood, we see a very similar branching formation but with different leaves of the tree. Higgs and Atwood’s version also includes a heat map while this version does not.

dist_euc <- dist(x=aa_dat2, method="euclidean")

aa_labels <- aa_dat$aa

## method="average" sets the function to do UPGMA
clust_euc<-hclust(d=dist_euc, method="average") 

plot(clust_euc, hang = -1, cex = 0.5, labels=aa_labels)