“Replicating Higgs and Attwood’s analysis on the properties of amino acids”

Introduction

In this script we want to replicate the analysis on the properties of amino acids done by Higgs and Attwood.By building data frames from multiple variables in relation to amino acids, we can identify relationships and possible correlations. In this script we make different types of plots and diagrams to visualize the patterns among the amino acids and their properties. We care about this biologically because by understanding the relationships between the properties of amino acids, we can better understand and predict their actions and functions.

Preliminaries

Packages

ggpubr creates wrappers for ggplot2’s core functionality.

library(permute) ## Loading required package: permute
 
library(lattice) ## Loading required package: lattice

library(vegan) ## This is vegan 2.5-6

## This is vegan 2.5-6

Build the dataframe

These data are derived mostly from Higgs (2009) Table 1. They are similar to Higgs and Attwood 2005 but have extra columns. Nathan Brouwer added additional categorical varibles based on online information. The data below represents properties of 20 amino acids.

# 1 letter code
aa <-c('A','C','D','E','F','G','H','I','K','L','M','N', 'P','Q','R','S','T','V','W','Y')

## molecular weight in dalton
MW.da <-c(89,121,133,146,165,75,155,131,146,131,149,132,115,147,174,105,119,117,204,181)

 ## vol from van der Waals radii
vol <-c(67,86,91,109,135,48,118,124,135,124,124,96,90, 114,148,73,93,105,163,141)

## bulk – a measure of the shape of the side chain
bulk <-c(11.5,13.46,11.68,13.57,19.8,3.4,13.69,21.4, 15.71,21.4,16.25,12.28,17.43,
14.45,14.28,9.47,15.77,21.57,21.67,18.03)

## pol – a measure of the electric field strength around the molecule
pol <-c(0,1.48,49.7,49.9,0.35,0,51.6,0.13,49.5,0.13, 1.43,3.38,1.58,3.53,52,1.67,1.66,0.13,2.1,1.61)

## isoelec point
isoelec <-c(6,5.07,2.77,3.22,5.48,5.97,7.59,6.02,9.74,5.98, 5.74,5.41,6.3,5.65,10.76,5.68,6.16,5.96,5.89,5.66)

## 1st Hydrophobicity scale
H2Opho.34 <-c(1.8,2.5,-3.5,-3.5,2.8,-0.4,-3.2,4.5,-3.9,3.8,1.9, -3.5,-1.6,-3.5,-4.5,-0.8,-0.7,4.2,-0.9,-1.3)

## 2nd Hydrophobicity scale
H2Opho.35 <-c(1.6,2,-9.2,-8.2,3.7,1,-3,3.1,-8.8,2.8,3.4,-4.8, -0.2,-4.1,-12.3,0.6,1.2,2.6,1.9,-0.7)

## Surface area accessible to water in an unfolded peptide
saaH2O <-c(113,140,151,183,218,85,194,182,211,180,204,158, 143,189,241,122,146,160,259,229)

## Fraction of accessible area lost when a protein folds
faal.fold <-c(0.74,0.91,0.62,0.62,0.88,0.72,0.78,0.88,0.52, 0.85,0.85,0.63,0.64,0.62,0.64,0.66,0.7,0.86,0.85,0.76)

# Polar requirement
polar.req <-c(7,4.8,13,12.5,5,7.9,8.4,4.9,10.1,4.9,5.3,10, 6.6,8.6,9.1,7.5,6.6,5.6,5.2,5.4)

## relative frequency of occurance
## "The frequencies column shows the mean
### percentage of each amino acid in the protein sequences ### of modern organisms"
freq <-c(7.8,1.1,5.19,6.72,4.39,6.77,2.03,6.95,6.32,
10.15,2.28,4.37,4.26,3.45,5.23,6.46,5.12,7.01,1.09,3.3)

##
# charges
## un = Un-charged
## neg = negative
## pos = positive 
charge<-c('un','un','neg','neg','un','un','pos','un','pos','un','un','un','un','un','pos','un','un','un','un','un')

## hydropathy
hydropathy<-c('hydrophobic','hydrophobic','hydrophilic',' hydrophilic','hydrophobic','neutral','neutral',
              'hydrophobic','hydrophilic','hydrophobic','hydrophobic',
              'hydrophilic','neutral',
              'hydrophilic','hydrophilic','neutral','neutral',
              'hydrophobic','hydrophobic','neutral')

## vol
vol.cat<-c('verysmall','small','small','medium', 'verylarge','verysmall','medium','large','large',
              'large','large','small','small','medium','large',
              'verysmall',
              'small','medium','verylarge','verylarge')

## pol
pol.cat<-c('nonpolar','nonpolar','polar','polar', 'nonpolar','nonpolar','polar','nonpolar',
                'polar','nonpolar','nonpolar','polar','nonpolar','polar',
                'polar','polar','polar','nonpolar','nonpolar','polar')

## chemical
chemical<-c('aliphatic','sulfur','acidic','acidic','aromatic', 'aliphatic','basic','aliphatic','basic','aliphatic','sulfur', 'amide','aliphatic','amide','basic','hydroxyl','hydroxyl', 'aliphatic','aromatic','aromatic')

Build the Dataframes

aa_dat <- data.frame(aa,MW.da, vol, bulk, pol, isoelec, H2Opho.34, H2Opho.35, saaH2O, faal.fold, polar.req, freq, charge, hydropathy, vol.cat, pol.cat, chemical)
aa_dat

##    aa MW.da vol  bulk   pol isoelec H2Opho.34 H2Opho.35 saaH2O faal.fold
## 1   A    89  67 11.50  0.00    6.00       1.8       1.6    113      0.74
## 2   C   121  86 13.46  1.48    5.07       2.5       2.0    140      0.91
## 3   D   133  91 11.68 49.70    2.77      -3.5      -9.2    151      0.62
## 4   E   146 109 13.57 49.90    3.22      -3.5      -8.2    183      0.62
## 5   F   165 135 19.80  0.35    5.48       2.8       3.7    218      0.88
## 6   G    75  48  3.40  0.00    5.97      -0.4       1.0     85      0.72
## 7   H   155 118 13.69 51.60    7.59      -3.2      -3.0    194      0.78
## 8   I   131 124 21.40  0.13    6.02       4.5       3.1    182      0.88
## 9   K   146 135 15.71 49.50    9.74      -3.9      -8.8    211      0.52
## 10  L   131 124 21.40  0.13    5.98       3.8       2.8    180      0.85
## 11  M   149 124 16.25  1.43    5.74       1.9       3.4    204      0.85
## 12  N   132  96 12.28  3.38    5.41      -3.5      -4.8    158      0.63
## 13  P   115  90 17.43  1.58    6.30      -1.6      -0.2    143      0.64
## 14  Q   147 114 14.45  3.53    5.65      -3.5      -4.1    189      0.62
## 15  R   174 148 14.28 52.00   10.76      -4.5     -12.3    241      0.64
## 16  S   105  73  9.47  1.67    5.68      -0.8       0.6    122      0.66
## 17  T   119  93 15.77  1.66    6.16      -0.7       1.2    146      0.70
## 18  V   117 105 21.57  0.13    5.96       4.2       2.6    160      0.86
## 19  W   204 163 21.67  2.10    5.89      -0.9       1.9    259      0.85
## 20  Y   181 141 18.03  1.61    5.66      -1.3      -0.7    229      0.76
##    polar.req  freq charge   hydropathy   vol.cat  pol.cat  chemical
## 1        7.0  7.80     un  hydrophobic verysmall nonpolar aliphatic
## 2        4.8  1.10     un  hydrophobic     small nonpolar    sulfur
## 3       13.0  5.19    neg  hydrophilic     small    polar    acidic
## 4       12.5  6.72    neg  hydrophilic    medium    polar    acidic
## 5        5.0  4.39     un  hydrophobic verylarge nonpolar  aromatic
## 6        7.9  6.77     un      neutral verysmall nonpolar aliphatic
## 7        8.4  2.03    pos      neutral    medium    polar     basic
## 8        4.9  6.95     un  hydrophobic     large nonpolar aliphatic
## 9       10.1  6.32    pos  hydrophilic     large    polar     basic
## 10       4.9 10.15     un  hydrophobic     large nonpolar aliphatic
## 11       5.3  2.28     un  hydrophobic     large nonpolar    sulfur
## 12      10.0  4.37     un  hydrophilic     small    polar     amide
## 13       6.6  4.26     un      neutral     small nonpolar aliphatic
## 14       8.6  3.45     un  hydrophilic    medium    polar     amide
## 15       9.1  5.23    pos  hydrophilic     large    polar     basic
## 16       7.5  6.46     un      neutral verysmall    polar  hydroxyl
## 17       6.6  5.12     un      neutral     small    polar  hydroxyl
## 18       5.6  7.01     un  hydrophobic    medium nonpolar aliphatic
## 19       5.2  1.09     un  hydrophobic verylarge nonpolar  aromatic
## 20       5.4  3.30     un      neutral verylarge    polar  aromatic

Raw data exploration

Correlation Matrix

This code makes a correlation matrix.

cor_ <- round(cor(aa_dat[,-c(1,13:17)]),2) 
diag(cor_) <- NA
cor_[upper.tri(cor_)] <- NA
cor_

##           MW.da   vol  bulk   pol isoelec H2Opho.34 H2Opho.35 saaH2O faal.fold
## MW.da        NA    NA    NA    NA      NA        NA        NA     NA        NA
## vol        0.93    NA    NA    NA      NA        NA        NA     NA        NA
## bulk       0.55  0.73    NA    NA      NA        NA        NA     NA        NA
## pol        0.29  0.24 -0.20    NA      NA        NA        NA     NA        NA
## isoelec    0.20  0.36  0.08  0.27      NA        NA        NA     NA        NA
## H2Opho.34 -0.27 -0.08  0.44 -0.67   -0.20        NA        NA     NA        NA
## H2Opho.35 -0.25 -0.16  0.32 -0.85   -0.26      0.85        NA     NA        NA
## saaH2O     0.97  0.99  0.64  0.29    0.35     -0.18     -0.23     NA        NA
## faal.fold  0.11  0.18  0.49 -0.53   -0.18      0.84      0.79   0.12        NA
## polar.req -0.05 -0.19 -0.53  0.76   -0.11     -0.79     -0.87  -0.11     -0.81
## freq      -0.52 -0.30 -0.04 -0.01    0.02      0.26     -0.02  -0.38     -0.18
##           polar.req freq
## MW.da            NA   NA
## vol              NA   NA
## bulk             NA   NA
## pol              NA   NA
## isoelec          NA   NA
## H2Opho.34        NA   NA
## H2Opho.35        NA   NA
## saaH2O           NA   NA
## faal.fold        NA   NA
## polar.req        NA   NA
## freq           0.14   NA

Locating the points of maximum correlation.

which(cor_ == max(cor_, na.rm = T), arr.ind = T)

##        row col
## saaH2O   8   2

cor_[which(cor_ == max(cor_, na.rm = T), arr.ind = T)]

## [1] 0.99

The maximum correlation (strongest positive) is between saaH2O (row 8) and volume (column 2) at 0.99.

Locating the points of minimum correlation.

which(cor_ == min(cor_, na.rm = T), arr.ind = T)

##           row col
## polar.req  10   7

cor_[which(cor_ == min(cor_, na.rm = T), arr.ind = T)]

## [1] -0.87

The minimum correlation (strongest negative) is between polar requirement (row 10) and the 2nd hyrophobicity scale (column 7) at -0.87.

Plot 0: Scatterplot matrix

The code below formats the scatterplot matrix by taking variables from the data frame.

panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor,...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits = digits)[1] 
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r) 
}

This code plots the scatterplot matrix.

aa_matrix <- data.frame(MW.da, vol, pol, H2Opho.34, polar.req, saaH2O, faal.fold)

plot(aa_matrix,upper.panel = panel.cor,
     panel = panel.smooth,
     main = "Plot 0: Scatterplot matrix")

We can observe the relationship between 6 variables in this scatterplot matrix. It shows correlation values and if these values are 0.7 or above there is a strong positive correlation between the variables and a strong negative correlation if the value is -0.7 or lower.

Plot 1: Replication Higgs and Attwood Figure 2.8

Replicating the plot from Figure 2.8 of Higgs and Attwood showing the relationship between volume and pI in amino acids.

#main plotting
plot(vol ~ isoelec,
     data = aa_dat,
     xlab = "pI",
     ylab = "Volume(Å3)",
     main = "Plot 1: Amino acid vol vs. pI",
     col = 0)

#adding text labels
text(vol ~ isoelec,
     labels = aa,
     data = aa_dat,
     col = 1:20)

The plot assists in visualizing the relationship between volume and isoeletric point in 20 amino acids.

Figure 2: HA Figure 2.8 with ggpubr

Making a plot that compares reelationship between polar requirement and frequency

ggpubr::ggscatter(y = "polar.req",
        x = "freq",
        size = "MW.da",
        color = "polar.req", 
        data = aa_dat,
        xlab = "polar requirement",
        ylab = "frequency")

Figure 3: Highly correlated data

ggpubr::ggscatter(y = "MW.da",
                  x = "vol",
                  add = "reg.line", # regression line 
                  ellipse = TRUE, # data ellipse
                  cor.coef = TRUE, # correlation coef
                  data = aa_dat, 
                  xlab = "Volume",
                  ylab = "Molecular weight")

## `geom_smooth()` using formula 'y ~ x'

Since the regression line shows a linear upward trend, the greater the 1st hyrophobicity, the larger the fraction of accessible are lost when a protein folds.The general formula id Y=B0+B1X. The data ellipse shows a bivariate mean and the points in a data set in a confidence interval.

Figure 4: Non-linear relationship

ggpubr::ggscatter(x = "saaH2O",
                  y = "polar.req",
                  data = aa_dat,
                  add ="loess",
                  size = "pol",
                  color = "faal.fold",
                  xlab = "Surface area accessible to water in an unfolded peptide",
                  ylab = "Polar Requirement")

## `geom_smooth()` using formula 'y ~ x'

Figure 5: Non-linear relationship on LOG scale

aa_dat$saaH2O_log <- log(aa_dat$saaH2O)
aa_dat$polar.req_log <- log(aa_dat$polar.req)
ggpubr::ggscatter(x = "saaH2O_log",
                  y = "polar.req_log",
                  data = aa_dat,
                  add = "reg.line",
                  size = "pol",
                  color = "faal.fold",
                  xlab = "Log Surface area accessible to water in an unfolded peptide",
                  ylab = "Log polar requirement")

## `geom_smooth()` using formula 'y ~ x'

The log transformation rescales the two variables from a non-linear relationship in order to make them linear.

Figure 6: 3D Scatterplot

library(scatterplot3d)
aa_3d <- scatterplot3d(x = aa_dat$H2Opho.34,
                     y = aa_dat$faal.fold,
                     z = aa_dat$polar.req,
              xlab = "H2Opho.34",
              ylab = "faal.fold",
              zlab = "polar.req",
              highlight.3d = TRUE,
              type = "h")

The correlation between the fraction of accessible area lost when a protein folds and the 1st hydrophobicity is positive. The correlation between polar requirement and the fraction of accessible area lost when a protein folds is negative. The correlation between polar requirement and 1st hydrophobicity is also negative.

Principal component analysis

PCA (principal component analysis) is a statistical technique that is used to reduce the dimensional data of data sets. This allows for easier examination and interpretation of the variances of many variables. Higgs and Atwood describe PCA as having the ability that “shows that there is no particular similarity between amino acids in the same row” (Higgs and Attwood 2005, pg 4). In this class, we used two basic ways we make PCA plots. The two ways are through vegan and base R. The number of objects in a data set and the number of properties evaluated are used in creating the plots.

Data prep

A new data set was created based off Table 2.2. The categorical variables were dropped.

aa_dat2 <- aa_dat[c("vol","bulk","pol","isoelec","H2Opho.34","H2Opho.35","saaH2O","faal.fold")]
row.names(aa_dat2) <- aa_dat$aa

Figure 7: Principal components analysis - base R

pca.out <- prcomp(aa_dat2, scale = TRUE)
biplot(pca.out)

Figure 8: Principal components analysis - vegan

scale = TRUE allows for variable scaling using standard devation.

rda.out <- vegan::rda(aa_dat2, scale = TRUE)
rda_scores <- scores(rda.out)
biplot(rda.out, display ="sites", col = 0)
orditorp(rda.out, display = "sites", labels = aa_dat$aa, cex = 1.2)
ordihull(rda.out,
         group = aa_dat$hydropathy,
         col = 1:7,
         lty = 1:7,
         lwd = c(3,6))

The plot above and the one from Figure 2.10 are a bit different. The variables are different with different numerical ranges. The orientation of the plots seem completely opposite, almost like it flipped.

Cluster analysis

The purpose of cluster analysis is to place objects into groups, or clusters, suggested by the data. This allows for the identification of patterns in the data set. UPGMA is a hierarchical cluster analysis method. A distance matrix is first made from data. The the Euclidean distance are also taken into consideration among the amino acids. Clusters are made by combining the most similar amino acids and then combining the nearest clusters to continue making bigger clusters.

Hieratchical cluster analysis

A distance matrix shows the difference between pairs of objects. For this analysis we are looking at differences between amino acids in different data categories. The Euclidean distance is the distance between two points in a Euclidean space. The numerical value of the points is the Euclidean distance.

dist_euc <- dist(aa_dat2,
                 method = "euclidean")
clust_euc <- hclust(dist_euc,
                    method = "average") # setting to use UPGMA
dendro_euc <- as.dendrogram(clust_euc)
plot(dendro_euc, horiz = TRUE,
     nodePar = list(pch = c(1,NA),
                    cex = 0.5,
                    lab.cex = 0.5))

There are a few difference between the cluster diagram above and the one produced by Higgs and Atwood. Some of the amino acids clustered together in the Higgs and Atwood diagram are not closely related in the dendrogram. Most noticeably, A and S are not on the same cluster in Higgs and Atwood but are closely related in the above dendrogram.