Introduction

We attempt to replicate Higgs and Attwood’s analysis on various properties of amino acids on a different set of data (Higgs 2009). We build data frames based on the data, and from there visualize our data in different ways in order to find patterns and relationships. We start with making a correlation matrix which allows us to see the strongest correlated and weakest correlated variables. We then visualize the variables that appear to have a linear relationship and ones that don’t to examine their relationship more closely. Finally, we conduct principal component analysis and cluster analysis on the amino acid data.

Learning how different properties of amino acids are related is biologically important because it is thought to be important for determining how the resulting protein might fold, allowing us to predict protein structure based on amino acid sequences and etcetra.

Packages

## Loading required package: permute
library(permute)
## Loading required package: Lattice
library(lattice)
## This is vegan 2.5-6
library(vegan)
## This is vegan 2.5-6

Build the Dataframe

These data are derived mostly from Higgs (2009) Table 1. They are similar to Higgs and Attwood 2005 but have extra columns. Nathan Brouwer added additional categorical varibles based on online information. Make the vectors. Data represents different properties of the 20 amino acids.

# 1 letter code
aa <-c('A','C','D','E','F','G','H','I','K','L','M','N',
'P','Q','R','S','T','V','W','Y')

## molecular weight in dalton
MW.da <-c(89,121,133,146,165,75,155,131,146,131,149,132,115,147,174,105,119,117,204,181)

## vol from van der Waals radii
vol <-c(67,86,91,109,135,48,118,124,135,124,124,96,90,
114,148,73,93,105,163,141)

## bulk – a measure of the shape of the side chain
bulk <-c(11.5,13.46,11.68,13.57,19.8,3.4,13.69,21.4,
15.71,21.4,16.25,12.28,17.43,
14.45,14.28,9.47,15.77,21.57,21.67,18.03)

## pol – a measure of the electric field strength around the molecule
pol <-c(0,1.48,49.7,49.9,0.35,0,51.6,0.13,49.5,0.13,
1.43,3.38,1.58,3.53,52,1.67,1.66,0.13,2.1,1.61)

## isoelec point
isoelec <-c(6,5.07,2.77,3.22,5.48,5.97,7.59,6.02,9.74,5.98,
5.74,5.41,6.3,5.65,10.76,5.68,6.16,5.96,5.89,5.66)

## 1st Hydrophobicity scale
H2Opho.34 <-c(1.8,2.5,-3.5,-3.5,2.8,-0.4,-3.2,4.5,-3.9,3.8,1.9,
-3.5,-1.6,-3.5,-4.5,-0.8,-0.7,4.2,-0.9,-1.3)

## 2nd Hydrophobicity scale
H2Opho.35 <-c(1.6,2,-9.2,-8.2,3.7,1,-3,3.1,-8.8,2.8,3.4,-4.8,
-0.2,-4.1,-12.3,0.6,1.2,2.6,1.9,-0.7)

## Surface area accessible to water in an unfolded peptide
saaH2O <-c(113,140,151,183,218,85,194,182,211,180,204,158,
143,189,241,122,146,160,259,229)

## Fraction of accessible area lost when a protein folds
faal.fold <-c(0.74,0.91,0.62,0.62,0.88,0.72,0.78,0.88,0.52,
0.85,0.85,0.63,0.64,0.62,0.64,0.66,0.7,0.86,0.85,0.76)

# Polar requirement
polar.req <-c(7,4.8,13,12.5,5,7.9,8.4,4.9,10.1,4.9,5.3,10,
6.6,8.6,9.1,7.5,6.6,5.6,5.2,5.4)

## relative frequency of occurance
## "The frequencies column shows the mean
### percentage of each amino acid in the protein sequences
### of modern organisms"
freq <-c(7.8,1.1,5.19,6.72,4.39,6.77,2.03,6.95,6.32,
10.15,2.28,4.37,4.26,3.45,5.23,6.46,5.12,7.01,1.09,3.3)

##
# charges
## un = Un-charged
## neg = negative
## pos = positive
charge<-c('un','un','neg','neg','un','un','pos','un','pos',
'un','un','un','un','un','pos','un','un','un','un','un')

## hydropathy
hydropathy<-c('hydrophobic','hydrophobic','hydrophilic','
hydrophilic','hydrophobic','neutral','neutral',
'hydrophobic','hydrophilic','hydrophobic','hydrophobic',
'hydrophilic','neutral',
'hydrophilic','hydrophilic','neutral','neutral',
'hydrophobic','hydrophobic','neutral')

## vol
vol.cat<-c('verysmall','small','small','medium',
'verylarge','verysmall','medium','large','large',
'large','large','small','small','medium','large',
'verysmall',
'small','medium','verylarge','verylarge')

## pol
pol.cat<-c('nonpolar','nonpolar','polar','polar',
'nonpolar','nonpolar','polar','nonpolar',
'polar','nonpolar','nonpolar','polar','nonpolar','polar',
'polar','polar','polar','nonpolar','nonpolar','polar')

## chemical
chemical<-c('aliphatic','sulfur','acidic','acidic','aromatic',
'aliphatic','basic','aliphatic','basic','aliphatic','sulfur',
'amide','aliphatic','amide','basic','hydroxyl','hydroxyl',
'aliphatic','aromatic','aromatic')

Build the data frames.

aa_dat <- data.frame(aa,MW.da,vol,bulk,pol,isoelec, H2Opho.34,
                     H2Opho.35, saaH2O, faal.fold, polar.req, freq,
                     charge, hydropathy, vol.cat, pol.cat, chemical)

Raw data exploration

Correlation Matrix

cor_ <- round(cor(aa_dat[,-c(1,13:17)]),2)
diag(cor_) <- NA
cor_[upper.tri(cor_)] <- NA
cor_
##           MW.da   vol  bulk   pol isoelec H2Opho.34 H2Opho.35 saaH2O faal.fold
## MW.da        NA    NA    NA    NA      NA        NA        NA     NA        NA
## vol        0.93    NA    NA    NA      NA        NA        NA     NA        NA
## bulk       0.55  0.73    NA    NA      NA        NA        NA     NA        NA
## pol        0.29  0.24 -0.20    NA      NA        NA        NA     NA        NA
## isoelec    0.20  0.36  0.08  0.27      NA        NA        NA     NA        NA
## H2Opho.34 -0.27 -0.08  0.44 -0.67   -0.20        NA        NA     NA        NA
## H2Opho.35 -0.25 -0.16  0.32 -0.85   -0.26      0.85        NA     NA        NA
## saaH2O     0.97  0.99  0.64  0.29    0.35     -0.18     -0.23     NA        NA
## faal.fold  0.11  0.18  0.49 -0.53   -0.18      0.84      0.79   0.12        NA
## polar.req -0.05 -0.19 -0.53  0.76   -0.11     -0.79     -0.87  -0.11     -0.81
## freq      -0.52 -0.30 -0.04 -0.01    0.02      0.26     -0.02  -0.38     -0.18
##           polar.req freq
## MW.da            NA   NA
## vol              NA   NA
## bulk             NA   NA
## pol              NA   NA
## isoelec          NA   NA
## H2Opho.34        NA   NA
## H2Opho.35        NA   NA
## saaH2O           NA   NA
## faal.fold        NA   NA
## polar.req        NA   NA
## freq           0.14   NA

Locate the indices where the maximum correlation (most positive) is found (not considering missing values).

which(cor_ == max(cor_, na.rm = T), arr.ind = T)
##        row col
## saaH2O   8   2
cor_[which(cor_ == max(cor_, na.rm = T), arr.ind = T)]
## [1] 0.99

We found that the strongest positive correlation is between saaH2O and volume, with a correlation of 0.99 at row 8 column 2. Locate the indices where the minimum correlation (most negative) is found (not considering missing values)

which(cor_ == min(cor_, na.rm = T), arr.ind = T)
##           row col
## polar.req  10   7
cor_[which(cor_ == min(cor_, na.rm = T), arr.ind = T)]
## [1] -0.87

The strongest negative correlation is between polar requirement and 2nd hydrophobicity scale with a correlation of -0.87 at row 10 column 7.

Figure 0: Scatterplot Matrix

A function is created to help format the scatterplot matrix. Takes input of two variables that would be from the data frame we previous created.

panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
    usr <- par("usr"); on.exit(par(usr))
    par(usr = c(0, 1, 0, 1))
    r <- abs(cor(x, y))
    txt <- format(c(r, 0.123456789), digits = digits)[1]
    txt <- paste0(prefix, txt)
    if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
    text(0.5, 0.5, txt, cex = cex.cor * r)
}
# building a sub-dataframe from the main data, including only the variables that we are considering
aa_matrix <- data.frame(MW.da, vol, pol, H2Opho.34, polar.req, saaH2O, faal.fold)
# plotting
plot(aa_matrix,upper.panel = panel.cor,
     panel = panel.smooth,
     main = "Scatterplot Matrix")

From the scatterplot matrix, we could see that these pairs of variables shows strong positive correlation: molecular weight and volume, molecular weight and surface area accessible by water, and volume and surface area accessible by water. All these pairs are relatively expected, as the larger an amino acid is, the more atoms it is likely to contain and the heavier its molecular weight would be. There also exists a ratio between the volume and surface area that is essentially linear positive in the scope we consider. On the other hand, surface area accessible by water seems to share non-linear relationship all other variables excluding MW and volume.

Figure 1: Replication Higgs and Attwood Figure 2.8

plot(vol ~ isoelec, #examining the relationship between these two variables
     data = aa_dat, #the respective columns comes from our dataframe
     xlab = "pI", #adds x label
     ylab = "Volume(Å3)", #adds y label
     main = "Amino acide volume vs. pI", #adds title
     col = 0)

#adding text to a plot
text(vol ~ isoelec,
     labels = aa, #using amino acid single letter abbreviations as label
     data = aa_dat,
     col = 1:20 #20 colors to be used for different texts
    )

The plot visualizes the relationship between amino acid volume against isoelectric point through all 20 amino acids.

Figure 2: HA Figure 2.8 with ggpubr

ggpubr::ggscatter(y = "polar.req",
          x = "freq",
          size = "polar.req",
          color = "polar.req",
          data = aa_dat,
          xlab = "Frequency",
          ylab = "Polar Requirement")

ggpubr::ggscatter(x = "faal.fold",
          y = "pol",
          size = "pol",
          shape = 18,
          color = "pol",
          data = aa_dat,
          ylab = "Polarity",
          xlab = "Fraction of accessible area lost upon fold")

Figure 3: Highly correlated data

ggpubr::ggscatter(x = "H2Opho.34",
                  y = "faal.fold",
                  data = aa_dat,
                  add = "reg.line", #adds a regression line
                  ellipse = T, #adds the ellipse
                  cor.coef = T, #adds the correlation coefficient
                  xlab = "1st Hydrophobicity scale",
                  ylab = "Fraction of lost area after folding")
## `geom_smooth()` using formula 'y ~ x'

The regression line indicates that the greater first hybrophicity scale value is, there’s likely a larger fraction of lost area after folding. The ellipse indicates a bivariate mean with roughly faal.fold = 0.75 and H2Opho.34 = 0. In addition, the covariance of these two variables seems to be relatively small as the ellipse is “narrow” around the regression line.

Figure 4: Non-linear relationship

ggpubr::ggscatter(x = "saaH2O",
                  y = "polar.req",
                  data = aa_dat,
                  add = "loess", #adds a loess smoother
                  size = "pol",
                  color = "faal.fold",
                  xlab = "Surface area accessible by water",
                  ylab = "Polar requirement")
## `geom_smooth()` using formula 'y ~ x'

Figure 5: Non-linear relationship on LOG scale

aa_dat$saaH2O_log <- log(aa_dat$saaH2O)
aa_dat$polar.req_log <- log(aa_dat$polar.req)
ggpubr::ggscatter(x = "saaH2O_log",
                  y = "polar.req_log",
                  data = aa_dat,
                  add = "reg.line", #adds a regression line
                  size = "pol",
                  color = "faal.fold",
                  xlab = "Log surface area accessible by water",
                  ylab = "Log polar requirement")
## `geom_smooth()` using formula 'y ~ x'

The log transformation is used to create allometric data from the non-linear relationship by re-scaling the two variables and making relationships linear.

Figure 6: 3D Scatterplot

library(scatterplot3d)
aa_3d <- scatterplot3d(x = aa_dat$H2Opho.34,
                       y = aa_dat$faal.fold,
                       z = aa_dat$polar.req,
                       xlab = "H2Opho.34",
                       ylab = "faal.fold",
                       zlab = "polar.req",
                       highlight.3d = T,
                       type = "h"
                       )

Polar requirement has a moderate negative correlation with both 1st hydrophobicity scale and fraction of area lost after folding. On the other hand, 1st hydrophobicity scale and fraction of area lost after folding shows positive correlation.

Principle component analysis

PCA is the process of reducing high dimensional data to lower dimensions (typically 2D) to examine part of the variance of multiple variables and makes such visualizations easier to interpret. As illustrated by Higgs and Attwood, the method transforms the cloud of data by “scaling them and shifting them to the origin, and then by rotating them in such a way that the points are spread out as much as possible” (Higgs and Attwood 2005, pg 26). The resulting graph shows different distance vectors for different variables, showing the covariance between each pair.

Two basic ways we made PCA plots are using base R and vegan. For both methods, we pass a matrix of NxP dimension where N is the number of objects in dataset and P is the number of properties we’re considering (Higgs and Attwood 2005), and the data set is scaled and shifted as described above.

Data Prep

Using the column index, we select only the first and third through tenth columns in original data set and make it a sub-data set.

aa_dat2 <- aa_dat[c(3:10)] # removing extra variables MW.da, polar.req and freq
row.names(aa_dat2) <- aa_dat$aa # assign row names

Figure 7: Principal components analysis - base R

pca.out <- prcomp(aa_dat2, scale. = T)
biplot(pca.out)

Figure 8: Principal components analysis - vegan

Setting scale to true in the vegan function enables scaling of the variables.

rda.out <- vegan::rda(aa_dat2, scale = T)
rda_scores <- scores(rda.out)
biplot(rda.out, display = "sites", col = 0)
orditorp(rda.out, display = "sites", labels = aa_dat$aa, cex = 1.2)
ordihull(rda.out,
         group = aa_dat$hydropathy,
         col = 1:7,
         lty = 1:7,
         lwd = c(3,6))

Cluster analysis

Cluster analysis is another way to look at relationships in large multivariable data sets. It is essentially the clustering of similar data points in order to determine the patterns in the data set, then building a dendrogram based off the data.

UPGMA is one such cluster analysis method (specifically a hierarchical cluster analysis) where we start by taking the original data to create a distance matrix. In this case, we account for the pair-wise quantitative difference (Euclidean distance) between the 20 amino acids. Then starting from the most similar amino acids, we iteratively combine the nearest two clusters into a larger cluster until all groups are included.

Hierarchical cluster analysis

A distance matrix is a matrix that shows pairwise difference for all the objects. In this case we’re looking at differences of the 20 amino acids across the 8 numeric data categories.

Euclidean distance is quite literally the distance between two points in an Euclidean space. Here we consider the geometric distance between the two clusters.

dist_euc <- dist(aa_dat2,
                 method = "euclidean")
clust_euc <- hclust(dist_euc,
                    method = "average") # setting method to average changes the agglomeration method to UPGMA
dendro_euc <- as.dendrogram(clust_euc)
plot(dendro_euc,horiz = T,
     nodePar = list(pch = c(1,NA), 
                                          cex = 0.5, 
                                          lab.cex = 0.5))

The dendrogram produces a different grouping pattern or different topology as compared to the cluster diagram produced by Higgs and Attwood. In example, A and S are very closely related in the dentrogram, but in the PCA cluster diagram they are in different clusters (green and blue). In addition, A, C, V, I, L, M, F, W are included in one cluster in the PCA diagram, while A, S, G, D are considered more similar to each other. In addition, whereas in PCA diagram we see a general mapping of amino acids with more similar properties, in the dendrogram we see even more complex relationships between them.