Replicating Higgs and Attwood’s analysis on the properties of amino acids

Introduction

This code is attempting to replicate the findings of Higgs and Attwood in their analysis of amino acids. A PCA will be run to create a dendrogram grouping the amino acids based on how evolutionarily similar to one another they are. Understanding amino acids is important in many bodily functions as proteins are responsible for many of the actions that keep the body functioning. Having knowledge of the relationships between different amino acids can allow scientists to determine solutions to different harmful mutations/problems that may come up and potentially to provide insight into other disorders that occur in the human body.

Preliminaries

Packages

ggpubr allows plotting with ggplot2 to become easier. It contains functions that allow the programmer to create and customize ggplot2-generated plots.

## Warning: package 'ggplot2' was built under R version 4.0.3

## Warning: package 'ggpubr' was built under R version 4.0.3

## Warning: package 'vegan' was built under R version 4.0.3

## Loading required package: permute

## Warning: package 'permute' was built under R version 4.0.3

## Loading required package: lattice

## This is vegan 2.5-6

## Warning: package 'scatterplot3d' was built under R version 4.0.3

Build the dataframe

These data are derived mostly from Higgs (2009) Table 1. They are similar to Higgs and Attwood 2005 but have extra columns. Nathan Brouwer added additional categorical variables based on online information.

This code creates vectors for each characteristic of the amino acid data.

# 1 letter code
aa        <-c('A','C','D','E','F','G','H','I','K','L','M','N',
              'P','Q','R','S','T','V','W','Y')


## molecular weight in dalton
MW.da     <-c(89,121,133,146,165,75,155,131,146,131,149,132,115,147,174,105,119,117,204,181)

## vol from van der Waals radii 
vol    <-c(67,86,91,109,135,48,118,124,135,124,124,96,90,
              114,148,73,93,105,163,141)

##  bulk – a measure of the shape of the side chain
bulk <-c(11.5,13.46,11.68,13.57,19.8,3.4,13.69,21.4,
              15.71,21.4,16.25,12.28,17.43,
              14.45,14.28,9.47,15.77,21.57,21.67,18.03)

## pol – a measure of the electric field strength around the molecule
pol  <-c(0,1.48,49.7,49.9,0.35,0,51.6,0.13,49.5,0.13,
              1.43,3.38,1.58,3.53,52,1.67,1.66,0.13,2.1,1.61)

## isoelec point
isoelec <-c(6,5.07,2.77,3.22,5.48,5.97,7.59,6.02,9.74,5.98,
                   5.74,5.41,6.3,5.65,10.76,5.68,6.16,5.96,5.89,5.66)

## 1st Hydrophobicity scale
H2Opho.34 <-c(1.8,2.5,-3.5,-3.5,2.8,-0.4,-3.2,4.5,-3.9,3.8,1.9,
                  -3.5,-1.6,-3.5,-4.5,-0.8,-0.7,4.2,-0.9,-1.3)

## 2nd Hydrophobicity scale
H2Opho.35 <-c(1.6,2,-9.2,-8.2,3.7,1,-3,3.1,-8.8,2.8,3.4,-4.8,
                  -0.2,-4.1,-12.3,0.6,1.2,2.6,1.9,-0.7)

## Surface area accessible to water in an unfolded peptide
saaH2O       <-c(113,140,151,183,218,85,194,182,211,180,204,158,
                 143,189,241,122,146,160,259,229)


## Fraction of accessible area lost when a protein folds
faal.fold    <-c(0.74,0.91,0.62,0.62,0.88,0.72,0.78,0.88,0.52,
                 0.85,0.85,0.63,0.64,0.62,0.64,0.66,0.7,0.86,0.85,0.76)



# Polar requirement 
polar.req    <-c(7,4.8,13,12.5,5,7.9,8.4,4.9,10.1,4.9,5.3,10,
                 6.6,8.6,9.1,7.5,6.6,5.6,5.2,5.4)

## relative frequency of occurance
## "The frequencies column shows the mean 
### percentage of each amino acid in the protein sequences 
### of modern organisms"
freq        <-c(7.8,1.1,5.19,6.72,4.39,6.77,2.03,6.95,6.32,
                10.15,2.28,4.37,4.26,3.45,5.23,6.46,5.12,7.01,1.09,3.3)

##
# charges
## un = Un-charged
## neg = negative
## pos = positive
charge<-c('un','un','neg','neg','un','un','pos','un','pos',
          'un','un','un','un','un','pos','un','un','un','un','un')

## hydropathy
hydropathy<-c('hydrophobic','hydrophobic','hydrophilic','
              hydrophilic','hydrophobic','neutral','neutral',
              'hydrophobic','hydrophilic','hydrophobic','hydrophobic',
              'hydrophilic','neutral',
              'hydrophilic','hydrophilic','neutral','neutral',
              'hydrophobic','hydrophobic','neutral')

## vol
vol.cat<-c('verysmall','small','small','medium',
              'verylarge','verysmall','medium','large','large',
              'large','large','small','small','medium','large',
              'verysmall',
              'small','medium','verylarge','verylarge')

## pol
pol.cat<-c('nonpolar','nonpolar','polar','polar',
                'nonpolar','nonpolar','polar','nonpolar',
                'polar','nonpolar','nonpolar','polar','nonpolar','polar',
                'polar','polar','polar','nonpolar','nonpolar','polar')

## chemical
chemical<-c('aliphatic','sulfur','acidic','acidic','aromatic',
            'aliphatic','basic','aliphatic','basic','aliphatic','sulfur',
            'amide','aliphatic','amide','basic','hydroxyl','hydroxyl',
            'aliphatic','aromatic','aromatic')

This dataframe contains all of the properties of amino acids in a consolidated form. It combines all of the vectors created above.

### Build dataframe
aa_dat <- data.frame(aa, 
                     MW.da,vol,
                     bulk, pol, 
                     isoelec,H2Opho.34, H2Opho.35,
                     saaH2O, faal.fold, polar.req,
                     freq, charge, hydropathy,
                     vol.cat, pol.cat, chemical)

Raw data exploration

Matrix formation

This code creates a matrix from the numerical data in the dataframe.

cor_ <- round(cor(aa_dat[,-c(1,13:17)]),2)
diag(cor_) <- NA
cor_[upper.tri(cor_)] <- NA
cor_

##           MW.da   vol  bulk   pol isoelec H2Opho.34 H2Opho.35 saaH2O faal.fold
## MW.da        NA    NA    NA    NA      NA        NA        NA     NA        NA
## vol        0.93    NA    NA    NA      NA        NA        NA     NA        NA
## bulk       0.55  0.73    NA    NA      NA        NA        NA     NA        NA
## pol        0.29  0.24 -0.20    NA      NA        NA        NA     NA        NA
## isoelec    0.20  0.36  0.08  0.27      NA        NA        NA     NA        NA
## H2Opho.34 -0.27 -0.08  0.44 -0.67   -0.20        NA        NA     NA        NA
## H2Opho.35 -0.25 -0.16  0.32 -0.85   -0.26      0.85        NA     NA        NA
## saaH2O     0.97  0.99  0.64  0.29    0.35     -0.18     -0.23     NA        NA
## faal.fold  0.11  0.18  0.49 -0.53   -0.18      0.84      0.79   0.12        NA
## polar.req -0.05 -0.19 -0.53  0.76   -0.11     -0.79     -0.87  -0.11     -0.81
## freq      -0.52 -0.30 -0.04 -0.01    0.02      0.26     -0.02  -0.38     -0.18
##           polar.req freq
## MW.da            NA   NA
## vol              NA   NA
## bulk             NA   NA
## pol              NA   NA
## isoelec          NA   NA
## H2Opho.34        NA   NA
## H2Opho.35        NA   NA
## saaH2O           NA   NA
## faal.fold        NA   NA
## polar.req        NA   NA
## freq           0.14   NA

This code chunk returns the array indices of the maximum value of the cor_ matrix. Surface area accessible to water and volume have the most positive correlation. Polar requirement and the second hydrophobicity scale have the most negative correlation.

which(cor_ == max(cor_, na.rm = T), arr.ind = T)

##        row col
## saaH2O   8   2

Plot 0: Scatterplot matrix

A function is used to create a panel for the scatterplot matrix that will compare all of the variables in cor_.

panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor,...)
{
    usr <- par("usr"); on.exit(par(usr))
    par(usr = c(0, 1, 0, 1))
    r <- abs(cor(x, y))
    txt <- format(c(r, 0.123456789), digits = digits)[1]
    txt <- paste0(prefix, txt)
    if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
    text(0.5, 0.5, txt, cex = cex.cor * r)
}

The code chunk below creates the scatterplot matrix. The strongest correlations are between MW.da & vol, MW.da & saaH2O, and saaH2O & vol. The weakest correlations are between MW.da & polar.req, vol & H2Opho.34, saaH2O & polar.req, and MW.da & faal.fold.

plot(aa_dat[c("MW.da", "vol", "pol", 
              "H2Opho.34", "polar.req",
              "saaH2O", "faal.fold")],
     panel = panel.smooth,
     upper.panel = panel.cor,
     main = "Example")

Plot 1: Replication Higgs and Attwood Figure 2.8

This is a plot of amino acid volume against pI, which are two important properties in protein folding.

Figure 2: HA Figure 2.8 with ggpubr

This is a different representation of the plot above. The point sizes are represented by the volume data and the color is represented by the bulk data.

ggscatter(y = "polar.req",
          x = "freq",
          size = "vol", 
          color = "bulk", 
          data = aa_dat,
          xlab = "pI",
          ylab = "Volume (Angstroms)")

Figure 3: Highly correlated data

The regression line is represented in the form y = B0 + B1x, where B1 is the slope of the regression line and B0 is the y intercept. The ellipse is showing the major axes of variation on a plot, both along the line of regression and perpendicular to it.

ggscatter(y = "saaH2O",
          x = "vol",
          size = "saaH2O",
          add = "reg.line", # line of best fit is added here
          ellipse = TRUE, # data ellipse is added here
          cor.coef = TRUE, # correlation coef is added here
          # color = "freq",
          data = aa_dat,
          xlab = "Volume",
          ylab = "Surface area accessible to water")

## `geom_smooth()` using formula 'y ~ x'

Figure 4: Non-linear relationship

This creates a scatter plot between the fraction of accessible area lost when a protein folds and the volume. This is not a linear relationship. Size is determined by surface area accessible to water in an unfolded peptide and color is represented by bulk.

ggscatter(y = "faal.fold",
          x = "vol",
          size = "saaH2O",
          color = "bulk",
          data = aa_dat,
          xlab = "Volume",
          ylab = "Fraction of accessible area lost when a protein folds") + 
  geom_smooth(method = "loess")

## `geom_smooth()` using formula 'y ~ x'

Figure 5: Non-linear relationship on LOG scale

Log transformation allows non-linear data to transform into more normal data when placed on a logarithmic scale.

aa_dat$faal.fold_log <- log10(aa_dat$faal.fold)
aa_dat$vol_log <- log10(aa_dat$vol)

ggscatter(y = "faal.fold_log",
          x = "vol_log",
          size = "saaH2O",
          color = "bulk",
          add = "reg.line",
          data = aa_dat,
          xlab = "Volume",
          ylab = "Fraction of accessible area lost when a protein folds")

## `geom_smooth()` using formula 'y ~ x'

Figure 6: 3D Scatterplot

The molecular weight and the volume have a highly positive correlation. The molecular weight and the surface area accessible to water also have a highly positive correlation. The volume and surface area accessible to water have the most highly positive correlation.

scatterplot3d(x = aa_dat$MW.da,
              y = aa_dat$vol,
              z = aa_dat$saaH2O,
              highlight.3d = TRUE,
              type = "h")

Principal component analysis

PCA is a plot that converts correlations between multiple into points on a 2D graph. The variables are placed into clusters of highly correlated data points. As stated in the paper, “what we need is a way of combining the information from all eight properties into a two-dimensional graph. This can be done with principal component analysis (PCA)”(Higgs and Attwood 2005, pg 26).

We can make PCA plots in multiple ways. First, by using the base R command prcomp() and creating a biplot of the output. Second, by using the vegan package to create a plot.

Data prep

To make this data frame, I first removed the categorical values that were present in aa_dat. Then, I compared the remaining numerical variables with Table 2.2 in Higgs and Attwood to remove any extra data.

aa_dat2 <- data.frame(vol, bulk, pol,  
                     isoelec, H2Opho.34, H2Opho.35,
                     saaH2O, faal.fold)

Figure 7: Principal component analysis - base R

Run a PCA and create a PCA plot using base R command.

pca.out <- prcomp(aa_dat2, scale = TRUE)
biplot(pca.out)

Figure 8: Principal component analysis - vegan

scale = TRUE will scale the data to the variance given, in this case correlations.

rda.out <- vegan::rda(aa_dat2,
                      scale = TRUE)

rda.scores <- scores(aa_dat2,
                     scale = TRUE)

This plot differs from Figure 2.10 in Higgs and Attwood because of the fact that it uses a different categorical value as the means of comparison. The plot I created has 3 distinct clusters, whereas Figure 2.10 has 4. This is because the charge variable has 3 different options: neutral, positive, and negative.

biplot(rda.out, display = "sites", col = 0,)
orditorp(rda.out, display = "sites", labels = aa_dat$aa, cex = 1.2)
ordihull(rda.out,
         group = aa_dat$charge,
         col = 1:7,
         lty = 1:7,
         lwd = c(3,6))

Cluster Analysis

Cluster analysis is used to put data points into groups to allow for classification and to demonstrate relationships between the variables. It can be done using various algorithms, each with different purposes and levels of specificity.

In order to run UPGMA, you must first make a distance matrix. This can be done using either Euclidean or Manhattan distance for numeric data. It is a hierarchical clustering method that uses the averages for distance. The distance between clusters is taken and updates as clusters are combined. The end product is a rooted dendrogram.

Hierarchal cluster analysis

Add row names.

row.names(aa_dat2) <- paste(aa_dat$aa,
                            #aa_dat$hydropathy,
                            sep = ".")

A distance matrix is a square matrix containing the pairwise distances between elements. It is the amount of differences between the elements. Euclidean distance is the distance between two points calculated using the Pythagorean theorem. It is also known as the Pythagorean distance.

dist_euc <- dist(aa_dat2,
                 method = "euclidean")

This cluster diagram differs from the diagram found in Plate 2.2 of Higgs and Attwood. The clades in my dendrogram are quite different, with most having at least one element that is not the same as Plate 2.2. This could be due to a different clustering algorithm being used or different variables being considered.

clust_euc <- hclust(dist_euc, 
                    method = "average",   # this makes it UPGMA
                    members = NULL)

plot(clust_euc, hang = -1, cex = 0.5)

par(mar = c(1,1,1,1))
is(clust_euc)

## [1] "hclust"

dendro_euc <- as.dendrogram(clust_euc)

is(dendro_euc)

## [1] "dendrogram"

plot(dendro_euc,horiz = T,nodePar = list(pch = c(1,NA), 
                                          cex = 0.5, 
                                          lab.cex = 0.5))