Preliminaries

In this R script, we look to replicate the work of Higgs and Atwood in analyzing the properties of the twenty different amino acids. These include numerical variables, such as the size and isoelectric points of the molecules, as well as categorical variables such as a given amino acid’s hydrophobicity. We then make extensive use of PCA and clustering algorithms to sort the amino acids which share such similar properties. Doing so will allow us to gain a better unerstanding of the mechanisms in which different amino acids function, a task which has wide implications for the treatment of diseases as well as drug discovery.

Packages

We begin by loading the necessary packages into R. Here, we note that the “permute” package helps generate data permutations which can be plotted, while the “lattice” package functions as a “high-level data visualization system”, which will aid us in the creation of some plots and graphs.

The “ggplot2” package is perhaps the preeminent data organizing/plotting package used in R - due to its complexity, we also run “ggpubr” as a wrapper, an accompanying package which simplifies its use.

library(ggplot2)
library(ggpubr)
## Warning: package 'ggpubr' was built under R version 4.0.3
library(permute)
## Warning: package 'permute' was built under R version 4.0.3
library(lattice)
library(vegan)
## Warning: package 'vegan' was built under R version 4.0.3
## This is vegan 2.5-6
library(scatterplot3d)
## Warning: package 'scatterplot3d' was built under R version 4.0.3

Build the dataframe

We begin with the following vectors. Each vector listed describes a number of different classifications for each different attribute of the given amino acid sequence. We will use this information throughout the analysis.

# 1 letter code
aa <-c('A','C','D','E','F','G','H','I','K','L','M','N',
'P','Q','R','S','T','V','W','Y')

## molecular weight in dalton
MW.da <-c(89,121,133,146,165,75,155,131,146,131,149,132,115,147,174,105,119,117,204,181)

## vol from van der Waals radii
vol <-c(67,86,91,109,135,48,118,124,135,124,124,96,90,
114,148,73,93,105,163,141)

## bulk – a measure of the shape of the side chain
bulk <-c(11.5,13.46,11.68,13.57,19.8,3.4,13.69,21.4,
15.71,21.4,16.25,12.28,17.43,
14.45,14.28,9.47,15.77,21.57,21.67,18.03)

## pol – a measure of the electric field strength around the molecule
pol <-c(0,1.48,49.7,49.9,0.35,0,51.6,0.13,49.5,0.13,
1.43,3.38,1.58,3.53,52,1.67,1.66,0.13,2.1,1.61)

## isoelec point
isoelec <-c(6,5.07,2.77,3.22,5.48,5.97,7.59,6.02,9.74,5.98,
5.74,5.41,6.3,5.65,10.76,5.68,6.16,5.96,5.89,5.66)

## 1st Hydrophobicity scale
H2Opho.34 <-c(1.8,2.5,-3.5,-3.5,2.8,-0.4,-3.2,4.5,-3.9,3.8,1.9,
-3.5,-1.6,-3.5,-4.5,-0.8,-0.7,4.2,-0.9,-1.3)

## 2nd Hydrophobicity scale
H2Opho.35 <-c(1.6,2,-9.2,-8.2,3.7,1,-3,3.1,-8.8,2.8,3.4,-4.8,
-0.2,-4.1,-12.3,0.6,1.2,2.6,1.9,-0.7)

## Surface area accessible to water in an unfolded peptide
saaH2O <-c(113,140,151,183,218,85,194,182,211,180,204,158,
143,189,241,122,146,160,259,229)

## Fraction of accessible area lost when a protein folds
faal.fold <-c(0.74,0.91,0.62,0.62,0.88,0.72,0.78,0.88,0.52,
0.85,0.85,0.63,0.64,0.62,0.64,0.66,0.7,0.86,0.85,0.76)

# Polar requirement
polar.req <-c(7,4.8,13,12.5,5,7.9,8.4,4.9,10.1,4.9,5.3,10,
6.6,8.6,9.1,7.5,6.6,5.6,5.2,5.4)

## relative frequency of occurrence
## "The frequencies column shows the mean
### percentage of each amino acid in the protein sequences
### of modern organisms"
freq <-c(7.8,1.1,5.19,6.72,4.39,6.77,2.03,6.95,6.32,
10.15,2.28,4.37,4.26,3.45,5.23,6.46,5.12,7.01,1.09,3.3)

##
# charges
## un = Un-charged
## neg = negative
## pos = positive
charge<-c('un','un','neg','neg','un','un','pos','un','pos','un','un','un','un','un','pos','un','un','un','un','un')

## hydropathy
hydropathy<-c('hydrophobic','hydrophobic','hydrophilic','
hydrophilic','hydrophobic','neutral','neutral',
'hydrophobic','hydrophilic','hydrophobic','hydrophobic',
'hydrophilic','neutral',
'hydrophilic','hydrophilic','neutral','neutral',
'hydrophobic','hydrophobic','neutral')

## vol
vol.cat<-c('verysmall','small','small','medium',
'verylarge','verysmall','medium','large','large',
'large','large','small','small','medium','large',
'verysmall',
'small','medium','verylarge','verylarge')

## pol
pol.cat<-c('nonpolar','nonpolar','polar','polar',
'nonpolar','nonpolar','polar','nonpolar',
'polar','nonpolar','nonpolar','polar','nonpolar','polar',
'polar','polar','polar','nonpolar','nonpolar','polar')

## chemical
chemical<-c('aliphatic','sulfur','acidic','acidic','aromatic',
'aliphatic','basic','aliphatic','basic','aliphatic','sulfur',
'amide','aliphatic','amide','basic','hydroxyl','hydroxyl',
'aliphatic','aromatic','aromatic')

Build the dataframes

aa_dat <- data.frame(aa, MW.da, vol, bulk, pol, isoelec, H2Opho.34, H2Opho.35, saaH2O, faal.fold, polar.req, freq, charge, hydropathy, vol.cat, pol.cat, chemical)
aa_dat
##    aa MW.da vol  bulk   pol isoelec H2Opho.34 H2Opho.35 saaH2O faal.fold
## 1   A    89  67 11.50  0.00    6.00       1.8       1.6    113      0.74
## 2   C   121  86 13.46  1.48    5.07       2.5       2.0    140      0.91
## 3   D   133  91 11.68 49.70    2.77      -3.5      -9.2    151      0.62
## 4   E   146 109 13.57 49.90    3.22      -3.5      -8.2    183      0.62
## 5   F   165 135 19.80  0.35    5.48       2.8       3.7    218      0.88
## 6   G    75  48  3.40  0.00    5.97      -0.4       1.0     85      0.72
## 7   H   155 118 13.69 51.60    7.59      -3.2      -3.0    194      0.78
## 8   I   131 124 21.40  0.13    6.02       4.5       3.1    182      0.88
## 9   K   146 135 15.71 49.50    9.74      -3.9      -8.8    211      0.52
## 10  L   131 124 21.40  0.13    5.98       3.8       2.8    180      0.85
## 11  M   149 124 16.25  1.43    5.74       1.9       3.4    204      0.85
## 12  N   132  96 12.28  3.38    5.41      -3.5      -4.8    158      0.63
## 13  P   115  90 17.43  1.58    6.30      -1.6      -0.2    143      0.64
## 14  Q   147 114 14.45  3.53    5.65      -3.5      -4.1    189      0.62
## 15  R   174 148 14.28 52.00   10.76      -4.5     -12.3    241      0.64
## 16  S   105  73  9.47  1.67    5.68      -0.8       0.6    122      0.66
## 17  T   119  93 15.77  1.66    6.16      -0.7       1.2    146      0.70
## 18  V   117 105 21.57  0.13    5.96       4.2       2.6    160      0.86
## 19  W   204 163 21.67  2.10    5.89      -0.9       1.9    259      0.85
## 20  Y   181 141 18.03  1.61    5.66      -1.3      -0.7    229      0.76
##    polar.req  freq charge    hydropathy   vol.cat  pol.cat  chemical
## 1        7.0  7.80     un   hydrophobic verysmall nonpolar aliphatic
## 2        4.8  1.10     un   hydrophobic     small nonpolar    sulfur
## 3       13.0  5.19    neg   hydrophilic     small    polar    acidic
## 4       12.5  6.72    neg \nhydrophilic    medium    polar    acidic
## 5        5.0  4.39     un   hydrophobic verylarge nonpolar  aromatic
## 6        7.9  6.77     un       neutral verysmall nonpolar aliphatic
## 7        8.4  2.03    pos       neutral    medium    polar     basic
## 8        4.9  6.95     un   hydrophobic     large nonpolar aliphatic
## 9       10.1  6.32    pos   hydrophilic     large    polar     basic
## 10       4.9 10.15     un   hydrophobic     large nonpolar aliphatic
## 11       5.3  2.28     un   hydrophobic     large nonpolar    sulfur
## 12      10.0  4.37     un   hydrophilic     small    polar     amide
## 13       6.6  4.26     un       neutral     small nonpolar aliphatic
## 14       8.6  3.45     un   hydrophilic    medium    polar     amide
## 15       9.1  5.23    pos   hydrophilic     large    polar     basic
## 16       7.5  6.46     un       neutral verysmall    polar  hydroxyl
## 17       6.6  5.12     un       neutral     small    polar  hydroxyl
## 18       5.6  7.01     un   hydrophobic    medium nonpolar aliphatic
## 19       5.2  1.09     un   hydrophobic verylarge nonpolar  aromatic
## 20       5.4  3.30     un       neutral verylarge    polar  aromatic

Raw data exploration

Here, we will establish a correlation matrix.

cor_ <- round(cor(aa_dat[,-c(1,13:17)]),2)
diag(cor_) <- NA
cor_[upper.tri(cor_)] <- NA
cor_
##           MW.da   vol  bulk   pol isoelec H2Opho.34 H2Opho.35 saaH2O faal.fold
## MW.da        NA    NA    NA    NA      NA        NA        NA     NA        NA
## vol        0.93    NA    NA    NA      NA        NA        NA     NA        NA
## bulk       0.55  0.73    NA    NA      NA        NA        NA     NA        NA
## pol        0.29  0.24 -0.20    NA      NA        NA        NA     NA        NA
## isoelec    0.20  0.36  0.08  0.27      NA        NA        NA     NA        NA
## H2Opho.34 -0.27 -0.08  0.44 -0.67   -0.20        NA        NA     NA        NA
## H2Opho.35 -0.25 -0.16  0.32 -0.85   -0.26      0.85        NA     NA        NA
## saaH2O     0.97  0.99  0.64  0.29    0.35     -0.18     -0.23     NA        NA
## faal.fold  0.11  0.18  0.49 -0.53   -0.18      0.84      0.79   0.12        NA
## polar.req -0.05 -0.19 -0.53  0.76   -0.11     -0.79     -0.87  -0.11     -0.81
## freq      -0.52 -0.30 -0.04 -0.01    0.02      0.26     -0.02  -0.38     -0.18
##           polar.req freq
## MW.da            NA   NA
## vol              NA   NA
## bulk             NA   NA
## pol              NA   NA
## isoelec          NA   NA
## H2Opho.34        NA   NA
## H2Opho.35        NA   NA
## saaH2O           NA   NA
## faal.fold        NA   NA
## polar.req        NA   NA
## freq           0.14   NA
which(cor_ == max(cor_, na.rm = T), arr.ind = T)
##        row col
## saaH2O   8   2
which(cor_ == min(cor_, na.rm = T), arr.ind = T)
##           row col
## polar.req  10   7

We observe that the above code highlights the maximum positive correlation between two variables in the matrix (in this case, between “saaH2O” and “vol” with an r-value of 0.99). By changing the “max” to “min”, we will obtain the maximum negative correlation between two variables, found between “polar.req” and “H2Opho.35”, with an r-value of -0.87.

Plot 0: Scatterplot Matrix

To create our scatterplot matrix, which will more clearly display the strength of relationships between variables, we first run the code below. This will serve as a correlation coefficient between two data series.

panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor,...)
{
    usr <- par("usr"); on.exit(par(usr))
    par(usr = c(0, 1, 0, 1))
    r <- abs(cor(x, y))
    txt <- format(c(r, 0.123456789), digits = digits)[1]
    txt <- paste0(prefix, txt)
    if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
    text(0.5, 0.5, txt, cex = cex.cor * r)
}

Now, we may generate our scatterplot matrix.

plot( ~ MW.da + vol + pol + H2Opho.34 + polar.req + H2Opho.35 + saaH2O,
     data = na.omit(aa_dat[,c("MW.da","vol","pol","H2Opho.34","polar.req","H2Opho.35", "saaH2O")]),
     panel = panel.smooth,
     upper.panel =panel.cor )

We observe a high r-value of 0.99 between “vol” and “saaH2O”, these two variables clearly have the strongest correlation among those plotted. Conversely, “MW.da” and “polar.req” have the lowest degree of correlation - their r-value of -0.05 is very close to zero, indication very little to no correlation between the two. We also observe that some relationships, like that between “saaH2O” and “H2Opho.35”, appear to be nonlinear.

Plot 1: Replication Higgs and Atwood Figure 2.8

The following code generates a 2D graph displaying the relationship between the volume of an amino acid, and its isoelectric point, as first performed by Higgs and Atwood.

plot(vol  ~ isoelec, data = aa_dat,
     xlab = "pI",
     ylab = "Volume",
     main = "Volume vs pI in Protein Folding",
     col = 0) #this chunk of code is doing the main plotting
text(vol ~ isoelec, 
     labels = aa, 
     data = aa_dat, 
     col = 1:20) #this chunk of code adds the labels

## Figure 2: HA Figure 2.8 with ggpubr

ggscatter(y = "vol",
x = "isoelec",
size = "H2Opho.34",
color = "H2Opho.35",
data = aa_dat,
xlab = "pI",
ylab = "Volume")

## Figure 3: Highly Correlated Data Now, we will create a plot to analyze the relationship between two highly correlated variables: “SaaH2O” and “vol”.

ggscatter(y = "saaH2O",
          x = "vol",
          add = "reg.line",   # regression line added
          ellipse = TRUE,     # ellipse included
          cor.coef = TRUE,    # correlation coefficient included
          data = aa_dat,
          xlab = "Volume from van der Waals radii",
          ylab = "Surface area accessible to water in unfolded peptide")
## `geom_smooth()` using formula 'y ~ x'

Our data ellipse here works by encircling the underlying mean proportional to a given confidence interval, which itself is related to our calculated p-value. We also observe that our regression line appears to follow a linear relationship.

Figure 4: Non-linear relationship

Here, we graph two variables whose relationship is nonlinear - that is, there is no straight line correlation between the two.

ggscatter(y = "pol",
          x = "MW.da",
          color = "isoelec",
          size = "faal.fold",
          add  = "loess",
          data = aa_dat,
          xlab = "Polarity",
          ylab = "Molecular weight in daltons")
## `geom_smooth()` using formula 'y ~ x'

Figure 5: Non-linear relationship on LOG scale

First, we log-transform the two variables plotted on the x and y axes

aa_dat $ MW.da_log <- log(aa_dat[,"MW.da"])
aa_dat $ pol_log <- log(aa_dat[,"pol"])

Next, we create a log scale with these newly transformed variables

ggscatter(y = "pol_log",
          x = "MW.da_log",
          add = "reg.line",
          color = "isoelec",
          size = "faal.fold",
          xlab = "ln(Molecular weight in daltons)",
          ylab = "ln(Polarity)",
          data = aa_dat)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (stat_smooth).

The log transformation takes the natural logarithm of each value in the given vector, allowing the regression line to appear more linear.

Figure 6: 3D Scatterplot

We observe in the scatterplot matrix that the variables “vol”, “MW.da”, and “saaH2O” appear to have the strongest correlation, all of which appear to be linear with r-values > 0.97. Graphing the 3D plot, we have the following.

scatterplot3d(x = aa_dat$vol,
              y = aa_dat$MW.da,
              z = aa_dat$saaH2O,
              highlight.3d = TRUE,
              type = "h")

Principal Component Analysis

Principal component analysis works by gradually reducing the dimensions of an n-dimensional data matrix while using algorithms to group similar data points closer to one another.

Higgs and Atwood make extensive use of this technique in their work, writing “[PCA functions as] a way of combining the information from all eight properties into a two-dimensional graph” (Higgs and Atwood 2005, pg 24).

In our class, we made PCA plots using the “vegan” package, as well as the basic-R commands.

Data prep

Now we will create a version of our dataframe sans any categorical variables. To do so, I looked at all the variables in “aa_dat”, and included only the numerical ones in this new dataframe.

aa_dat2 = data.frame(vol, bulk, pol, isoelec, H2Opho.34, H2Opho.35, saaH2O, faal.fold)

Figure 7: Principal Components Analysis - base R

pca <- prcomp(aa_dat2, scale = TRUE)
biplot(pca)

Figure 8: Principal Components analysis - vegan

We begin by running the PCA using vegan, and calling our output “rda.out”.

rda.out <- vegan::rda(aa_dat2, scale = TRUE)

Next, we extract the PCA score.

rda_scores <- scores(rda.out)

From here, we create a PCA plot using the “biplot” function. We observe, that when compared to the PCA plot generated by Higgs and Atwood, that both the x and y-axis scales differ. However, the shape and location of the various groupings appear to resemble that of the original PCA plot.

biplot(rda.out, display = "sites", col = 0,)
orditorp(rda.out, display = "sites", labels = aa_dat$aa, cex = 1.2)
ordihull(rda.out,
group = aa_dat$charge,
        col = 1:7,
        lty = 1:7,
        lwd = c(3,6))

Cluster Analysis

Now we turn our focus to cluster analysis, which we have used in class to create phylogenetic trees from a given set of data. In this case, the clustering algorithm groups similar data points together.

One type of aforementioned clustering algorithms is UPGMA, known as a hierarchical agglomerative method, which means that data points begin in their own separate clusters before being joined to one another. The distances between all pairs of data points in two clusters are averaged, and the clusters with the shortest average distances are joined together. This process is repeated indefinitely until no cluster is unconnected.

row.names(aa_dat2) = paste(aa_dat$aa, sep = ".")

dist_euc <- dist(aa_dat2, 
                method = "euclidean")

clust_euc <- hclust(dist_euc, 
                   method = "average") #average refers to "average linkage method", another name for UPGMA

dendro_euc <- as.dendrogram(clust_euc) 

plot(dendro_euc,
     horiz = T,
     nodePar = list(pch = c(1,NA),
                    cex = 0.5,
                    lab.cex = 0.5))

My cluster diagram appears similar, at first glance, to the one created by Higgs and Atwood. The main differences appear to be the inclusion of a heat map to correlate differences between species, as well as minor discrepancies in the number of branches present (mine has two, theirs has three). This is likely due to different clustering algorithms being used.