Replicating Higgs and Attwood’s analysis on the properties of amino acids

Introduction

An analysis of a set of amino acids is done by this script. The script uses R functions and packages to analyze the properties and similarities of amino acids. ## Preliminaries ### Packages 1. ggplot2 create elegant data visualisations using the grammar of graphics 2. ggpubr General Arguments Description 3. scatterplot3d: Plots a three dimensional point cloud. 4. vegan:the vegan package provides tools for descriptive community ecology.

## Loading required package: permute

## Loading required package: lattice

## This is vegan 2.5-6

GGpubr makes ggplot2 easier allowing the user to create more complex and compounded graphs.

library(permute)
library(lattice)

Build the dataframe

These data are derived mostly from Higgs (2009) Table 1. They are similar to Higgs and Attwood 2005 but have extra columns. Nathan Brouwer added additional categorical varibles based on online information.

Make the vectors.

# 1 letter code
aa  <-c('A','C','D','E','F','G','H','I','K','L','M','N','P','Q','R','S','T','V','W','Y')

# molecular weight in dalton
MW.da     <-c(89,121,133,146,165,75,155,131,146,131,149,132,115,147,174,105,119,117,204,181)

# vol from van der Waals radii
volume    <-c(67,86,91,109,135,48,118,124,135,124,124,96,90,
              114,148,73,93,105,163,141)

# bulk – a measure of the shape of the side chain
bulkiness <-c(11.5,13.46,11.68,13.57,19.8,3.4,13.69,21.4,
              15.71,21.4,16.25,12.28,17.43,
              14.45,14.28,9.47,15.77,21.57,21.67,18.03)

## pol – a measure of the electric field strength around the molecule
polarity  <-c(0,1.48,49.7,49.9,0.35,0,51.6,0.13,49.5,0.13,
              1.43,3.38,1.58,3.53,52,1.67,1.66,0.13,2.1,1.61)

# isoelec point
isoelectric.pt <-c(6,5.07,2.77,3.22,5.48,5.97,7.59,6.02,9.74,5.98,
                   5.74,5.41,6.3,5.65,10.76,5.68,6.16,5.96,5.89,5.66)

# 1st Hydrophobicity scale
hydrophobe.34 <-c(1.8,2.5,-3.5,-3.5,2.8,-0.4,-3.2,4.5,-3.9,3.8,1.9,
                  -3.5,-1.6,-3.5,-4.5,-0.8,-0.7,4.2,-0.9,-1.3)

# 2nd Hydrophobicity scale
hydrophobe.35 <-c(1.6,2,-9.2,-8.2,3.7,1,-3,3.1,-8.8,2.8,3.4,-4.8,
                  -0.2,-4.1,-12.3,0.6,1.2,2.6,1.9,-0.7)

# Surface area accessible to water in an unfolded peptide
saaH2O       <-c(113,140,151,183,218,85,194,182,211,180,204,158,
                 143,189,241,122,146,160,259,229)

# Fraction of accessible area lost when a protein folds
faal.fold    <-c(0.74,0.91,0.62,0.62,0.88,0.72,0.78,0.88,0.52,
                 0.85,0.85,0.63,0.64,0.62,0.64,0.66,0.7,0.86,0.85,0.76)

# Polar requirement
polar.req    <-c(7,4.8,13,12.5,5,7.9,8.4,4.9,10.1,4.9,5.3,10,
                 6.6,8.6,9.1,7.5,6.6,5.6,5.2,5.4)

## relative frequency of occurrance
## "The frequencies column shows the mean
### percentage of each amino acid in the protein sequences
### of modern organisms"
freq        <-c(7.8,1.1,5.19,6.72,4.39,6.77,2.03,6.95,6.32,
                10.15,2.28,4.37,4.26,3.45,5.23,6.46,5.12,7.01,1.09,3.3)


# charges
## un = Un-charged
## neg = negative
## pos = positive
charge<-c('un','un','neg','neg','un','un','pos','un','pos',
          'un','un','un','un','un','pos','un','un','un','un','un')

# hydropathy
hydropathy<-c('hydrophobic','hydrophobic','hydrophilic','
              hydrophilic','hydrophobic','neutral','neutral','hydrophobic','hydrophilic','hydrophobic','hydrophobic','hydrophilic','  neutral',
              'hydrophilic','hydrophilic','neutral','neutral',
              'hydrophobic','hydrophobic','neutral')

# vol
volume.cat<-c('verysmall','small','small','medium',
              'verylarge','verysmall','medium','large','large',
              'large','large','small','small','medium','large','verysmall','small','medium','verylarge','verylarge')

# pol
polarity.cat<-c('nonpolar','nonpolar','polar','polar',
                'nonpolar','nonpolar','polar','nonpolar',
                'polar','nonpolar','nonpolar','polar','nonpolar','polar',
                'polar','polar','polar','nonpolar','nonpolar','polar')

# chemical
chemical<-c('aliphatic','sulfur','acidic','acidic','aromatic',
            'aliphatic','basic','aliphatic','basic','aliphatic','sulfur',
            'amide','aliphatic','amide','basic','hydroxyl','hydroxyl',
            'aliphatic','aromatic','aromatic')

Build the dataframes.

### Build dataframe
aa_dat <- data.frame(aa,MW.da,volume,bulkiness,
polarity,isoelectric.pt,hydrophobe.34,hydrophobe.35,
saaH2O,faal.fold, polar.req,freq,charge,hydropathy,volume.cat,polarity.cat,chemical)
aa_dat

##    aa MW.da volume bulkiness polarity isoelectric.pt hydrophobe.34
## 1   A    89     67     11.50     0.00           6.00           1.8
## 2   C   121     86     13.46     1.48           5.07           2.5
## 3   D   133     91     11.68    49.70           2.77          -3.5
## 4   E   146    109     13.57    49.90           3.22          -3.5
## 5   F   165    135     19.80     0.35           5.48           2.8
## 6   G    75     48      3.40     0.00           5.97          -0.4
## 7   H   155    118     13.69    51.60           7.59          -3.2
## 8   I   131    124     21.40     0.13           6.02           4.5
## 9   K   146    135     15.71    49.50           9.74          -3.9
## 10  L   131    124     21.40     0.13           5.98           3.8
## 11  M   149    124     16.25     1.43           5.74           1.9
## 12  N   132     96     12.28     3.38           5.41          -3.5
## 13  P   115     90     17.43     1.58           6.30          -1.6
## 14  Q   147    114     14.45     3.53           5.65          -3.5
## 15  R   174    148     14.28    52.00          10.76          -4.5
## 16  S   105     73      9.47     1.67           5.68          -0.8
## 17  T   119     93     15.77     1.66           6.16          -0.7
## 18  V   117    105     21.57     0.13           5.96           4.2
## 19  W   204    163     21.67     2.10           5.89          -0.9
## 20  Y   181    141     18.03     1.61           5.66          -1.3
##    hydrophobe.35 saaH2O faal.fold polar.req  freq charge
## 1            1.6    113      0.74       7.0  7.80     un
## 2            2.0    140      0.91       4.8  1.10     un
## 3           -9.2    151      0.62      13.0  5.19    neg
## 4           -8.2    183      0.62      12.5  6.72    neg
## 5            3.7    218      0.88       5.0  4.39     un
## 6            1.0     85      0.72       7.9  6.77     un
## 7           -3.0    194      0.78       8.4  2.03    pos
## 8            3.1    182      0.88       4.9  6.95     un
## 9           -8.8    211      0.52      10.1  6.32    pos
## 10           2.8    180      0.85       4.9 10.15     un
## 11           3.4    204      0.85       5.3  2.28     un
## 12          -4.8    158      0.63      10.0  4.37     un
## 13          -0.2    143      0.64       6.6  4.26     un
## 14          -4.1    189      0.62       8.6  3.45     un
## 15         -12.3    241      0.64       9.1  5.23    pos
## 16           0.6    122      0.66       7.5  6.46     un
## 17           1.2    146      0.70       6.6  5.12     un
## 18           2.6    160      0.86       5.6  7.01     un
## 19           1.9    259      0.85       5.2  1.09     un
## 20          -0.7    229      0.76       5.4  3.30     un
##                     hydropathy volume.cat polarity.cat  chemical
## 1                  hydrophobic  verysmall     nonpolar aliphatic
## 2                  hydrophobic      small     nonpolar    sulfur
## 3                  hydrophilic      small        polar    acidic
## 4  \n              hydrophilic     medium        polar    acidic
## 5                  hydrophobic  verylarge     nonpolar  aromatic
## 6                      neutral  verysmall     nonpolar aliphatic
## 7                      neutral     medium        polar     basic
## 8                  hydrophobic      large     nonpolar aliphatic
## 9                  hydrophilic      large        polar     basic
## 10                 hydrophobic      large     nonpolar aliphatic
## 11                 hydrophobic      large     nonpolar    sulfur
## 12                 hydrophilic      small        polar     amide
## 13                     neutral      small     nonpolar aliphatic
## 14                 hydrophilic     medium        polar     amide
## 15                 hydrophilic      large        polar     basic
## 16                     neutral  verysmall        polar  hydroxyl
## 17                     neutral      small        polar  hydroxyl
## 18                 hydrophobic     medium     nonpolar aliphatic
## 19                 hydrophobic  verylarge     nonpolar  aromatic
## 20                     neutral  verylarge        polar  aromatic

Raw data exploration

correlation and matrices

The code creates a correlation matrix

cor_ <- round(cor(aa_dat[,-c(1,13:17)]),2)
diag(cor_) <- NA
cor_[upper.tri(cor_)] <- NA
cor_

##                MW.da volume bulkiness polarity isoelectric.pt hydrophobe.34
## MW.da             NA     NA        NA       NA             NA            NA
## volume          0.93     NA        NA       NA             NA            NA
## bulkiness       0.55   0.73        NA       NA             NA            NA
## polarity        0.29   0.24     -0.20       NA             NA            NA
## isoelectric.pt  0.20   0.36      0.08     0.27             NA            NA
## hydrophobe.34  -0.27  -0.08      0.44    -0.67          -0.20            NA
## hydrophobe.35  -0.25  -0.16      0.32    -0.85          -0.26          0.85
## saaH2O          0.97   0.99      0.64     0.29           0.35         -0.18
## faal.fold       0.11   0.18      0.49    -0.53          -0.18          0.84
## polar.req      -0.05  -0.19     -0.53     0.76          -0.11         -0.79
## freq           -0.52  -0.30     -0.04    -0.01           0.02          0.26
##                hydrophobe.35 saaH2O faal.fold polar.req freq
## MW.da                     NA     NA        NA        NA   NA
## volume                    NA     NA        NA        NA   NA
## bulkiness                 NA     NA        NA        NA   NA
## polarity                  NA     NA        NA        NA   NA
## isoelectric.pt            NA     NA        NA        NA   NA
## hydrophobe.34             NA     NA        NA        NA   NA
## hydrophobe.35             NA     NA        NA        NA   NA
## saaH2O                 -0.23     NA        NA        NA   NA
## faal.fold               0.79   0.12        NA        NA   NA
## polar.req              -0.87  -0.11     -0.81        NA   NA
## freq                   -0.02  -0.38     -0.18      0.14   NA

In order to find where the max correlation occurs, the code below uses functions.The most positive correlation is between SaaH2O and volume, whereas polar.req and hydrophobe.35 have the most negative correlation.

which(cor_ == max(cor_, na.rm = T), arr.ind = T)

##        row col
## saaH2O   8   2

Plot 0: Scatterplot matrix

panel.cor: Correlation coefficient panel for pairs function

panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor,...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r)
}

The code below plots the scatter plot matrices.

plot(~MW.da+volume+polarity+hydrophobe.34+polar.req+bulkiness,
     data = na.omit(aa_dat[,c("MW.da","volume","polarity","hydrophobe.34","polar.req","bulkiness")]),
     panel = panel.smooth,
     upper.panel =panel.cor )

We are able to analyze and observe the relationships between the 6 variables from this information in the scatterplot matrix.There is a strong positive correlation if the correlation value is greater or equal to 0.7.Whereas, there is a strong negative correlation if the value is less than or equal to -0.7.

Plot 1: Replication Higgs and Attwood Figure 2.8

Now, we will replicate Figure 2.8 from the Higgs and Atwood paper showing the relationship between amino acid volume and pI.

plot(polar.req  ~ freq, data = aa_dat,
     xlab = "pI",
     ylab = "Volume (Å3)",
     main = "Amino Acid Vol vs. pI ",
     col = 0)
text(polar.req ~ freq, 
     labels = aa, 
     data = aa_dat, 
     col = 1:20)

The plot represents and shows the relationship between amino acid volume and pI. From this plot we can deduce the amino acid and its property based on where it is located within the figure.

Figure 2: HA Figure 2.8 with ggpubr

This graph compares the relationship and correlation between the polar requirement and frequency. As a result, the color and size of each point is at the hand of the molecular weight and volume.

ggscatter(y = "polar.req",
x = "freq",
size = "MW.da",
color = "volume",
data = aa_dat,
xlab = "Polar Requirement",
ylab = "Frequency")

Figure 3: Highly correlated data

Next, we will identify two variables that have a very strong positive correlation to one another and plot them in ggpubr.

ggscatter(y = "MW.da",
          x = "volume",
          add = "reg.line",   #adds regression line
          ellipse = TRUE,   #adds the ellipse
          cor.coef = TRUE,   #adds the correlation coefficient
          data = aa_dat,
          xlab = "Volume",
          ylab = "Molecular Weight")

## `geom_smooth()` using formula 'y ~ x'

## geom_smooth() using formula ‘y ~ x’ The general formula for a regression line is N = B0 + B1T. An ellipse represents all the points in a data set that are within a specified confidence interval.

Figure 4: Non-linear relationship

We will now identify two variables that have a non-linear relationship, which are volume and polar requirement.

ggscatter(data = aa_dat,
          y = "volume",
          x = "polar.req",
          size = "MW.da",   
          color = "polarity",
          add = "loess") +
  xlab("polar requirment")     +
  ylab("volume")

## `geom_smooth()` using formula 'y ~ x'

## geom_smooth() using formula ‘y ~ x’

Figure 5: Non-linear relationship on LOG scale

We will want to re-scale the data using the natural log to make it more linear. To do so, we will need to assign new values in the data frame that entail the log versions of volume and polar.req.

aa_dat$volume_log <- log(aa_dat[,"volume"])
aa_dat$polar.req_log <- log(aa_dat[,"polar.req"])

The code below will create the log-scale scatter plot:

aa_dat$volume_log <- log10(aa_dat$volume)
aa_dat$polar.req_log <- log10(aa_dat$polar.req)
ggscatter(y = "volume_log",
          x = "polar.req_log",
          add = "reg.line",
          size = "MW.da",
          color = "polarity",
          xlab = "Log10 polar requirement",
          ylab = "Log10 volume",
          data = aa_dat)

## `geom_smooth()` using formula 'y ~ x'

## geom_smooth() using formula ‘y ~ x’

Figure 6: 3D Scatterplot

The code below will create a 3D scatter plot that demonstrates the relationship / correlation between MW.da, volume, and saaH20. These three variables have relatively strong correlations with one another.

scatterplot3d(x = aa_dat$MW.da,
                     y = aa_dat$volume,
                     z = aa_dat$saaH2Og,
              xlab = "aa_dat$MW.da",
              highlight.3d = TRUE,
              type = "h")

Using the 3D scatter plot that was created above the relationship between MW.da, volume, and saaH20 can be clearly seen and defined. They demonstrate relatively strong relationships with each other on this type of plane and we can see that by looking at the lines that are created in relation to the other variables.

Principal Component Analysis

Principal component analysis, or PCA, is a statistical technique that is widely used to reduce the dimensionality of very large data sets. PCA is able to recalculate coordinates to match the new set of axes - essentially flattening the data down (reduction). It spreads the data out and can color code various variables and can take a 6D plot to a 2D one, with the greatest amount of variation.

Higgs and Atwood explain PCA as it has the ability to “shows that there is no particular similarity between amino acids in the same row” (Higgs and Attwood 2005, pg 4).

Data Prep

Now, we will create a dataset to drop any categorical variables and compare it to Table 2.2.

aa_dat2 <- aa_dat[,c("volume","bulkiness","polarity","isoelectric.pt","hydrophobe.34","hydrophobe.35","saaH2O","faal.fold")]

Figure 7: Principal components analysis - base R

We will now run a PCA and create a PCA plot using a base R command and

pca.out <- prcomp(aa_dat2, scale = TRUE)
biplot(pca.out)

### Figure 8: Principal components analysis - vegan We will run the PCA using vegan and call the output rda.out.

rda.out <- vegan::rda(aa_dat2, scale = TRUE)

Scale = TRUE is used to properly and correctly scale numeric data using their respective standard deviations.

Next, we will extract the PCA score and call it rda_scores.

rda_scores <- scores(rda.out)

Now, we will create a biplot using the code below.

biplot(rda.out, display = "sites", col = 0,)
orditorp(rda.out, display = "sites", labels = aa_dat$aa, cex = 1.2)
ordihull(rda.out,
group = aa_dat$charge,
col = 1:7,
lty = 1:7,
lwd = c(3,6))

The plot that is created above varies greatly from that in the Higgs and Atwood paper. For one,the axes and scaling is a bit different than the plot that is shown in the paper. In addition, the name / label of the x and y axes are different, as well. Lastly, the orientation of the figure shown is nearly opposite of the of Figure 2.10.

Cluster Analysis

Cluster analysis is a tool used in statistics that is composed of various algorithms that are ultimately used to classify different objects into similar groups, that assumes a constant rate of evolution. It’s done by comparing the similarity between two objects and if they are maximal they belong in the same group and vice versa. The overall purpose is to place objects into groups or “clusters”.

UPGMA is a type of clustering method that utilizes a bottom-up hierarchical clustering method. it stands for unweighted pair-group method with arithmetic mean. It is a distance method that requires a distance matrix, as a result. UPGMA is a clustering method that is used to generate dendrograms, among other types of visual plots.

Hierarchical Cluster Analysis

aa_dat2 = aa_dat[,-c(1,2,11,12,13,14,15,16,17)]

row.names(aa_dat2) = paste(aa_dat$aa,sep = ".")

dist_euc = dist(aa_dat2,method = "euclidean")

clust_euc = hclust(dist_euc,method = "average") #sets it to use UPGMA

dendro_euc = as.dendrogram(clust_euc)
plot(dendro_euc,horiz = T,
     nodePar = list(pch = c(1,NA),
                    cex = 0.5,
                    lab.cex = 1.2))

A matrix of distance is a table that is created that can show the distance between a pair of objects. It is a square matrix that takes into account distances, calculations in pairs and the property elements of a set. A matrix of Euclidean distances is a matrix of square distances between two points. It is slightly different from a normal distance matrix in the sense that as opposed to a regular distance matrix, the distances between two objects will be squared. This analysis of the cluster shows some variations between the one in the paper on Higgs and Atwood. Some of the amino acids, for example, are shifted farther / closer than those in the paper. As a consequence, this shows a distinction in the relative proximity of different amino acids.