Introduction

This script replicates Higg’s and Attwood’s analysis on amino acids. Here, we will explore amino acids and their functionsin protein folding through bioinformatics in R, specifically using scatterplots, Principal Component Analysis (PCA), and Hierarchical Cluster Analysis using the vegan package.

Preliminaries

Packages

#a package used for data visualization
library(ggplot2)

#allows for 'ggplot2' customization; a "wrapper" for 'ggplot2' package
library(ggpubr)

#a set of restricted permtation designes 
library(permute)

#an add on package for data visualization
library(lattice)

#a package commonly used by community ecologists for multivarate analysis
library(vegan)
## This is vegan 2.5-6
#used to make 3-D scatterplots
library(scatterplot3d) 

ggplot2 is a package used for data visualization, and ggpubr is a “wrapper” that allows for even more functions to the ggplot2 package

Building the Dataframe

These data are derived mostly from Higgs (2009) Table 1. They are similar to Higgs and Attwood 2005 but have extra columns. Nathan Brouwer added additional categorical varibles based on online information. The data here display characteristics of amino acids.

# 1 letter code
aa <-c('A','C','D','E','F','G','H','I','K','L','M','N',
'P','Q','R','S','T','V','W','Y')

## molecular weight in dalton
MW.da <-c(89,121,133,146,165,75,155,131,146,131,149,132,115,147,174,105,119,117,204,181)

## vol from van der Waals radii
vol <-c(67,86,91,109,135,48,118,124,135,124,124,96,90,
114,148,73,93,105,163,141)

## bulk – a measure of the shape of the side chain
bulk <-c(11.5,13.46,11.68,13.57,19.8,3.4,13.69,21.4,
15.71,21.4,16.25,12.28,17.43,
14.45,14.28,9.47,15.77,21.57,21.67,18.03)

## pol – a measure of the electric field strength around the molecule
pol <-c(0,1.48,49.7,49.9,0.35,0,51.6,0.13,49.5,0.13,
1.43,3.38,1.58,3.53,52,1.67,1.66,0.13,2.1,1.61)

## isoelec point
isoelec <-c(6,5.07,2.77,3.22,5.48,5.97,7.59,6.02,9.74,5.98,
5.74,5.41,6.3,5.65,10.76,5.68,6.16,5.96,5.89,5.66)

## 1st Hydrophobicity scale
H2Opho.34 <-c(1.8,2.5,-3.5,-3.5,2.8,-0.4,-3.2,4.5,-3.9,3.8,1.9,
-3.5,-1.6,-3.5,-4.5,-0.8,-0.7,4.2,-0.9,-1.3)

## 2nd Hydrophobicity scale
H2Opho.35 <-c(1.6,2,-9.2,-8.2,3.7,1,-3,3.1,-8.8,2.8,3.4,-4.8,
-0.2,-4.1,-12.3,0.6,1.2,2.6,1.9,-0.7)

## Surface area accessible to water in an unfolded peptide
saaH2O <-c(113,140,151,183,218,85,194,182,211,180,204,158,
143,189,241,122,146,160,259,229)

## Fraction of accessible area lost when a protein folds
faal.fold <-c(0.74,0.91,0.62,0.62,0.88,0.72,0.78,0.88,0.52,
0.85,0.85,0.63,0.64,0.62,0.64,0.66,0.7,0.86,0.85,0.76)

# Polar requirement
polar.req <-c(7,4.8,13,12.5,5,7.9,8.4,4.9,10.1,4.9,5.3,10,
6.6,8.6,9.1,7.5,6.6,5.6,5.2,5.4)

## relative frequency of occurance
## "The frequencies column shows the mean
### percentage of each amino acid in the protein sequences
### of modern organisms"
freq <-c(7.8,1.1,5.19,6.72,4.39,6.77,2.03,6.95,6.32,
10.15,2.28,4.37,4.26,3.45,5.23,6.46,5.12,7.01,1.09,3.3)


# charges
## un = Un-charged
## neg = negative
## pos = positive
charge<-c('un','un','neg','neg','un','un','pos','un','pos','un','un','un','un','un','pos','un','un','un','un','un')

## hydropathy
hydropathy<-c('hydrophobic','hydrophobic','hydrophilic','
hydrophilic','hydrophobic','neutral','neutral',
'hydrophobic','hydrophilic','hydrophobic','hydrophobic',
'hydrophilic','neutral',
'hydrophilic','hydrophilic','neutral','neutral',
'hydrophobic','hydrophobic','neutral')

## vol
vol.cat<-c('verysmall','small','small','medium',
'verylarge','verysmall','medium','large','large',
'large','large','small','small','medium','large',
'verysmall',
'small','medium','verylarge','verylarge')

## pol
pol.cat<-c('nonpolar','nonpolar','polar','polar',
'nonpolar','nonpolar','polar','nonpolar',
'polar','nonpolar','nonpolar','polar','nonpolar','polar',
'polar','polar','polar','nonpolar','nonpolar','polar')

## chemical
chemical<-c('aliphatic','sulfur','acidic','acidic','aromatic',
'aliphatic','basic','aliphatic','basic','aliphatic','sulfur',
'amide','aliphatic','amide','basic','hydroxyl','hydroxyl',
'aliphatic','aromatic','aromatic')
log_vol = log(vol)
log_polar_req = log(polar.req)
aa_dat <- data.frame(aa,MW.da,vol,bulk,
pol,isoelec,H2Opho.34,H2Opho.35,
saaH2O,faal.fold, polar.req,freq,charge,hydropathy,vol.cat,pol.cat,chemical, log_vol, log_polar_req)
aa_dat
##    aa MW.da vol  bulk   pol isoelec H2Opho.34 H2Opho.35 saaH2O faal.fold
## 1   A    89  67 11.50  0.00    6.00       1.8       1.6    113      0.74
## 2   C   121  86 13.46  1.48    5.07       2.5       2.0    140      0.91
## 3   D   133  91 11.68 49.70    2.77      -3.5      -9.2    151      0.62
## 4   E   146 109 13.57 49.90    3.22      -3.5      -8.2    183      0.62
## 5   F   165 135 19.80  0.35    5.48       2.8       3.7    218      0.88
## 6   G    75  48  3.40  0.00    5.97      -0.4       1.0     85      0.72
## 7   H   155 118 13.69 51.60    7.59      -3.2      -3.0    194      0.78
## 8   I   131 124 21.40  0.13    6.02       4.5       3.1    182      0.88
## 9   K   146 135 15.71 49.50    9.74      -3.9      -8.8    211      0.52
## 10  L   131 124 21.40  0.13    5.98       3.8       2.8    180      0.85
## 11  M   149 124 16.25  1.43    5.74       1.9       3.4    204      0.85
## 12  N   132  96 12.28  3.38    5.41      -3.5      -4.8    158      0.63
## 13  P   115  90 17.43  1.58    6.30      -1.6      -0.2    143      0.64
## 14  Q   147 114 14.45  3.53    5.65      -3.5      -4.1    189      0.62
## 15  R   174 148 14.28 52.00   10.76      -4.5     -12.3    241      0.64
## 16  S   105  73  9.47  1.67    5.68      -0.8       0.6    122      0.66
## 17  T   119  93 15.77  1.66    6.16      -0.7       1.2    146      0.70
## 18  V   117 105 21.57  0.13    5.96       4.2       2.6    160      0.86
## 19  W   204 163 21.67  2.10    5.89      -0.9       1.9    259      0.85
## 20  Y   181 141 18.03  1.61    5.66      -1.3      -0.7    229      0.76
##    polar.req  freq charge    hydropathy   vol.cat  pol.cat  chemical  log_vol
## 1        7.0  7.80     un   hydrophobic verysmall nonpolar aliphatic 4.204693
## 2        4.8  1.10     un   hydrophobic     small nonpolar    sulfur 4.454347
## 3       13.0  5.19    neg   hydrophilic     small    polar    acidic 4.510860
## 4       12.5  6.72    neg \nhydrophilic    medium    polar    acidic 4.691348
## 5        5.0  4.39     un   hydrophobic verylarge nonpolar  aromatic 4.905275
## 6        7.9  6.77     un       neutral verysmall nonpolar aliphatic 3.871201
## 7        8.4  2.03    pos       neutral    medium    polar     basic 4.770685
## 8        4.9  6.95     un   hydrophobic     large nonpolar aliphatic 4.820282
## 9       10.1  6.32    pos   hydrophilic     large    polar     basic 4.905275
## 10       4.9 10.15     un   hydrophobic     large nonpolar aliphatic 4.820282
## 11       5.3  2.28     un   hydrophobic     large nonpolar    sulfur 4.820282
## 12      10.0  4.37     un   hydrophilic     small    polar     amide 4.564348
## 13       6.6  4.26     un       neutral     small nonpolar aliphatic 4.499810
## 14       8.6  3.45     un   hydrophilic    medium    polar     amide 4.736198
## 15       9.1  5.23    pos   hydrophilic     large    polar     basic 4.997212
## 16       7.5  6.46     un       neutral verysmall    polar  hydroxyl 4.290459
## 17       6.6  5.12     un       neutral     small    polar  hydroxyl 4.532599
## 18       5.6  7.01     un   hydrophobic    medium nonpolar aliphatic 4.653960
## 19       5.2  1.09     un   hydrophobic verylarge nonpolar  aromatic 5.093750
## 20       5.4  3.30     un       neutral verylarge    polar  aromatic 4.948760
##    log_polar_req
## 1       1.945910
## 2       1.568616
## 3       2.564949
## 4       2.525729
## 5       1.609438
## 6       2.066863
## 7       2.128232
## 8       1.589235
## 9       2.312535
## 10      1.589235
## 11      1.667707
## 12      2.302585
## 13      1.887070
## 14      2.151762
## 15      2.208274
## 16      2.014903
## 17      1.887070
## 18      1.722767
## 19      1.648659
## 20      1.686399

Raw data exploration Correlation Matrix

cor_ <- round(cor(aa_dat[,-c(1,13:17)]),2)
diag(cor_) <- NA
cor_[upper.tri(cor_)] <- NA
cor_
##               MW.da   vol  bulk   pol isoelec H2Opho.34 H2Opho.35 saaH2O
## MW.da            NA    NA    NA    NA      NA        NA        NA     NA
## vol            0.93    NA    NA    NA      NA        NA        NA     NA
## bulk           0.55  0.73    NA    NA      NA        NA        NA     NA
## pol            0.29  0.24 -0.20    NA      NA        NA        NA     NA
## isoelec        0.20  0.36  0.08  0.27      NA        NA        NA     NA
## H2Opho.34     -0.27 -0.08  0.44 -0.67   -0.20        NA        NA     NA
## H2Opho.35     -0.25 -0.16  0.32 -0.85   -0.26      0.85        NA     NA
## saaH2O         0.97  0.99  0.64  0.29    0.35     -0.18     -0.23     NA
## faal.fold      0.11  0.18  0.49 -0.53   -0.18      0.84      0.79   0.12
## polar.req     -0.05 -0.19 -0.53  0.76   -0.11     -0.79     -0.87  -0.11
## freq          -0.52 -0.30 -0.04 -0.01    0.02      0.26     -0.02  -0.38
## log_vol        0.91  0.98  0.77  0.25    0.31     -0.07     -0.17   0.96
## log_polar_req -0.09 -0.22 -0.58  0.74   -0.03     -0.82     -0.86  -0.15
##               faal.fold polar.req  freq log_vol log_polar_req
## MW.da                NA        NA    NA      NA            NA
## vol                  NA        NA    NA      NA            NA
## bulk                 NA        NA    NA      NA            NA
## pol                  NA        NA    NA      NA            NA
## isoelec              NA        NA    NA      NA            NA
## H2Opho.34            NA        NA    NA      NA            NA
## H2Opho.35            NA        NA    NA      NA            NA
## saaH2O               NA        NA    NA      NA            NA
## faal.fold            NA        NA    NA      NA            NA
## polar.req         -0.81        NA    NA      NA            NA
## freq              -0.18      0.14    NA      NA            NA
## log_vol            0.17     -0.16 -0.29      NA            NA
## log_polar_req     -0.86      0.99  0.14    -0.2            NA

The code below uses the which() function to determine the maximum correlation value in the data frame and removes any NA values in the calculations. The most positive correlation is the saaH2O and volume, while the least is the polar.req and hydrophobe.35.

which(cor_ == max(cor_, na.rm = T), arr.ind = T)
##               row col
## saaH2O          8   2
## log_polar_req  13  10
which(cor_ == min(cor_, na.rm = T), arr.ind = T)
##           row col
## polar.req  10   7

Plot 0: Scatterplot matrix

pairs(aa_dat[2:8])

The function below creates lines of best fit in the scatterplot correlation matrix to show relatedness of the variables.

panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor,...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r)
}

On the plots of the six variables plotted, MW.da vs. vol have a linear relationship. The other variables have no clear relationship. H2Opho.34 and polar.req have a linear relationship.

Plot 1: Replication Higgs and Attwood Figure 2.8

THe plot indicates the correlation between amino acid volume and polarity, which is coded for by the main plotting.

Correlation scatterplot matrix

plot(polar.req  ~ freq, data = aa_dat,
     xlab = "Polar requirement",
     ylab = "Volume (A^3)",
     main = "Amino Acid vol vs. pI",
     col = 0)
text(polar.req ~ freq, 
     labels = aa, 
     data = aa_dat, 
     col = 1:20)

Figure 2: HA Figure 2.8 with ggpubr

ggscatter(y = "polar.req",
x = "freq",
size = "polar.req",
color = "polar.req",
data = aa_dat,
xlab = "freq",
ylab = "polar.req")

Figure 3: Highly correlated data Plot 2 variables with positive correlation using ggscatter to plot volume and MW.da.

ggscatter(y = "vol",
          x = "MW.da",
          add = "reg.line", #Line of best fit
          ellipse = TRUE, #Data ellipse
          cor.coef = TRUE, #Correlation coefficient
          data = aa_dat,
          xlab = "MW.da",
          ylab = "Volume")
## `geom_smooth()` using formula 'y ~ x'

The line of best fit has an equation of N = B0 +B1T, and the ellipse represents the standard deviation.

Figure 4: Non-linear relationship Using a non-linear relationship to represent the data, we will use a loess smoother to smooth the line.

ggscatter(data = aa_dat,
          y = "vol",
          x = "polar.req",
          size = "MW.da",   
          color = "pol",
          add = "loess") +
  xlab("polar requirment")     +
  ylab("Volume")
## `geom_smooth()` using formula 'y ~ x'

Figure 5: Non-linear relationship on LOG scale Using a log scale, the pI and volume correlation can be examined to determine the correlation. The log scale makes the data more linear for the volume and pI data.

ggscatter(data = aa_dat,
          y = "log_vol",
          x = "log_polar_req",
          size = "bulk",   
          color = "freq",
          add = "reg.line") +
  xlab("log(pI)")     +
  ylab("log(Volume)")
## `geom_smooth()` using formula 'y ~ x'

Figure 6: 3D Scatterplot The data doesn’t indicate much correlation, as there aren’t cluster in the data.

scatterplot3d(x = aa_dat$polar.req,
                     y = aa_dat$MW.da,
                     z = aa_dat$svol,
              highlight.3d = TRUE,
              angle = 60,
              type="h")

Principal component analysis

PCA, or principal component analysis,is a commonly used analysis method in computational biology. According to the Higgs paper, a PCA is a way of “visualizing the important features of multidimensional data sets, such as tables of amino acid properties. PCA chooses coordinates that are linear combinations of the original variables in such a way that the maximum variability between the data points is explained by the first few coordinates” (Higgs and Attwood 2005, pg 25). The two basic ways we make PCA plots in R are through base R and through the vegan package. The rda() function in vegan should be used.

Data Prep The data must be prepped to remove categorical variables. A new data frame will be made of just numerical values. Only 8 of the 11 values will be used to align with the Higgs and Attwood table.

aa_dat2 = data.frame(vol, bulk, pol, isoelec,H2Opho.34, H2Opho.35, saaH2O, faal.fold)

Figure 7: Principal components analysis - base R PCA plot with base R command

pca.out <- prcomp(aa_dat2[,-c(7,8)], scale = TRUE)

biplot(pca.out)

Figure 8: Principal component analysis - vegan PCA using vegan; for another example see https://rpubs.com/brouwern/veganpca

rda.out <- vegan::rda(aa_dat2, scale = TRUE)

Extract the known PCA score

rda_scores <- scores(rda.out)

Making a “biplot,” sometimes referred to as a PCA plot

biplot(rda.out, display = "sites", col = 0,)
orditorp(rda.out, display = "sites", labels = aa_dat$aa, cex = 1.2)
ordihull(rda.out,
group = aa_dat$hydropathy,
col = 1:7,
lty = 1:7,
lwd = c(3,6))

This figure is different from the one in Figure 2.10, page 28, of Higgs and Attwood. One difference, for instance, is that this plot has groupings while the Higgs and Attwood plot does not.

Cluster Analysis Heirarchical Cluster Analysis

Cluster analysis is used to break down a large data set into useful subgroups. One type of cluster analysis is UPGMA, which is more specifically a hierarchical cluster analysis.

Here, are the row names for the UPGMA.

row.names(aa_dat2) <- paste(aa_dat$aa,1:nrow(aa_dat2),sep = ".")
row.names(aa_dat2)
##  [1] "A.1"  "C.2"  "D.3"  "E.4"  "F.5"  "G.6"  "H.7"  "I.8"  "K.9"  "L.10"
## [11] "M.11" "N.12" "P.13" "Q.14" "R.15" "S.16" "T.17" "V.18" "W.19" "Y.20"

A distance matrix can be made with the function dist() using the euclidean algorithm (method = “euclidian”). The distance matrix creates a two dimensional matrix, where the distances can be determined to represent differences or similarities between two points.

dist_euc <- dist(aa_dat2[,-9], 
                 method = "euclidean")

A cluster analysis can be done with hclust() function with the UPGMA algorithm.

clust_euc <- hclust(dist_euc, method = "average")
#setting the method to "average" makes the hclust() function perform a UPGMA cluster analysis. Hclust() defaults to a complete cluster analysis.

Plotting a dendrogram The cluster diagram compares to Higgs and Atwood’s in Plate 2.2 in that Higgs and Attwood included a heat map in their dendrogram and the clustering is different in their histogram compared to this one.

par(mar = c(1,1,1,1))
dendro_euc <- as.dendrogram(clust_euc)
plot(dendro_euc,horiz = T,nodePar = list(pch = c(1,NA), 
                                          cex = 0.5, 
                                          lab.cex = 0.5))