Replicating Higgs and Attwood’s analysis on the properties of amino acids

Introduction

This particular script is focused on replicating Higgs and Attwood’s analysis on the properties of 20 amino acids. Clustering is used to group different amino acids based on their similarities and differences. Some of the clustering methods used include PCA and UPGMA. This script helps us visualize the similarities and differences between amino acids, allowing scientists to learn more about shared properties, which can later be applied to futher studies of amino acids in several areas such as computational drug discovery.

Preliminaries

Packages

## Warning: package 'ggplot2' was built under R version 4.0.3

## Warning: package 'ggpubr' was built under R version 4.0.3

## Warning: package 'vegan' was built under R version 4.0.3

## Loading required package: permute

## Warning: package 'permute' was built under R version 4.0.3

## Loading required package: lattice

## This is vegan 2.5-6

## Warning: package 'scatterplot3d' was built under R version 4.0.3

Build the dataframe

These data are derived mostly from Higgs (2009) Table 1. They are similar to Higgs and Attwood 2005 but have extra columns. Nathan Brouwer added additional categorical varibles based on online information.

Make the vectors.

# 1 letter code
aa        <-c('A','C','D','E','F','G','H','I','K','L','M','N',
              'P','Q','R','S','T','V','W','Y')


## molecular weight in dalton
MW.da     <-c(89,121,133,146,165,75,155,131,146,131,149,132,115,147,174,105,119,117,204,181)

## vol from van der Waals radii 
vol    <-c(67,86,91,109,135,48,118,124,135,124,124,96,90,
              114,148,73,93,105,163,141)

##  bulk – a measure of the shape of the side chain
bulk <-c(11.5,13.46,11.68,13.57,19.8,3.4,13.69,21.4,
              15.71,21.4,16.25,12.28,17.43,
              14.45,14.28,9.47,15.77,21.57,21.67,18.03)

## pol – a measure of the electric field strength around the molecule
pol  <-c(0,1.48,49.7,49.9,0.35,0,51.6,0.13,49.5,0.13,
              1.43,3.38,1.58,3.53,52,1.67,1.66,0.13,2.1,1.61)

## isoelec point
isoelec <-c(6,5.07,2.77,3.22,5.48,5.97,7.59,6.02,9.74,5.98,
                   5.74,5.41,6.3,5.65,10.76,5.68,6.16,5.96,5.89,5.66)

## 1st Hydrophobicity scale
H2Opho.34 <-c(1.8,2.5,-3.5,-3.5,2.8,-0.4,-3.2,4.5,-3.9,3.8,1.9,
                  -3.5,-1.6,-3.5,-4.5,-0.8,-0.7,4.2,-0.9,-1.3)

## 2nd Hydrophobicity scale
H2Opho.35 <-c(1.6,2,-9.2,-8.2,3.7,1,-3,3.1,-8.8,2.8,3.4,-4.8,
                  -0.2,-4.1,-12.3,0.6,1.2,2.6,1.9,-0.7)

## Surface area accessible to water in an unfolded peptide
saaH2O       <-c(113,140,151,183,218,85,194,182,211,180,204,158,
                 143,189,241,122,146,160,259,229)


## Fraction of accessible area lost when a protein folds
faal.fold    <-c(0.74,0.91,0.62,0.62,0.88,0.72,0.78,0.88,0.52,
                 0.85,0.85,0.63,0.64,0.62,0.64,0.66,0.7,0.86,0.85,0.76)



# Polar requirement 
polar.req    <-c(7,4.8,13,12.5,5,7.9,8.4,4.9,10.1,4.9,5.3,10,
                 6.6,8.6,9.1,7.5,6.6,5.6,5.2,5.4)

## relative frequency of occurance
## "The frequencies column shows the mean 
### percentage of each amino acid in the protein sequences 
### of modern organisms"
freq        <-c(7.8,1.1,5.19,6.72,4.39,6.77,2.03,6.95,6.32,
                10.15,2.28,4.37,4.26,3.45,5.23,6.46,5.12,7.01,1.09,3.3)

##
# charges
## un = Un-charged
## neg = negative
## pos = positive
charge<-c('un','un','neg','neg','un','un','pos','un','pos',
          'un','un','un','un','un','pos','un','un','un','un','un')

## hydropathy
hydropathy<-c('hydrophobic','hydrophobic','hydrophilic','
              hydrophilic','hydrophobic','neutral','neutral',
              'hydrophobic','hydrophilic','hydrophobic','hydrophobic',
              'hydrophilic','neutral',
              'hydrophilic','hydrophilic','neutral','neutral',
              'hydrophobic','hydrophobic','neutral')

## vol
vol.cat<-c('verysmall','small','small','medium',
              'verylarge','verysmall','medium','large','large',
              'large','large','small','small','medium','large',
              'verysmall',
              'small','medium','verylarge','verylarge')

## pol
pol.cat<-c('nonpolar','nonpolar','polar','polar',
                'nonpolar','nonpolar','polar','nonpolar',
                'polar','nonpolar','nonpolar','polar','nonpolar','polar',
                'polar','polar','polar','nonpolar','nonpolar','polar')

## chemical
chemical<-c('aliphatic','sulfur','acidic','acidic','aromatic',
            'aliphatic','basic','aliphatic','basic','aliphatic','sulfur',
            'amide','aliphatic','amide','basic','hydroxyl','hydroxyl',
            'aliphatic','aromatic','aromatic')

Build the dataframes. This code generates a data frame withe all the information that can be analyzed later using other functions.

### Build dataframe
aa_dat <- data.frame(aa, 
                     MW.da,vol,
                     bulk, pol, 
                     isoelec,H2Opho.34, H2Opho.35,
                     saaH2O, faal.fold, polar.req,
                     freq, charge, hydropathy,
                     vol.cat, pol.cat, chemical)

Raw data exploration

Correlation Matrix

This code generates a correlation matrix.

cor_ <- round(cor(aa_dat[,-c(1,13:17)]),2)
diag(cor_) <- NA
cor_[upper.tri(cor_)] <- NA
cor_

##           MW.da   vol  bulk   pol isoelec H2Opho.34 H2Opho.35 saaH2O faal.fold
## MW.da        NA    NA    NA    NA      NA        NA        NA     NA        NA
## vol        0.93    NA    NA    NA      NA        NA        NA     NA        NA
## bulk       0.55  0.73    NA    NA      NA        NA        NA     NA        NA
## pol        0.29  0.24 -0.20    NA      NA        NA        NA     NA        NA
## isoelec    0.20  0.36  0.08  0.27      NA        NA        NA     NA        NA
## H2Opho.34 -0.27 -0.08  0.44 -0.67   -0.20        NA        NA     NA        NA
## H2Opho.35 -0.25 -0.16  0.32 -0.85   -0.26      0.85        NA     NA        NA
## saaH2O     0.97  0.99  0.64  0.29    0.35     -0.18     -0.23     NA        NA
## faal.fold  0.11  0.18  0.49 -0.53   -0.18      0.84      0.79   0.12        NA
## polar.req -0.05 -0.19 -0.53  0.76   -0.11     -0.79     -0.87  -0.11     -0.81
## freq      -0.52 -0.30 -0.04 -0.01    0.02      0.26     -0.02  -0.38     -0.18
##           polar.req freq
## MW.da            NA   NA
## vol              NA   NA
## bulk             NA   NA
## pol              NA   NA
## isoelec          NA   NA
## H2Opho.34        NA   NA
## H2Opho.35        NA   NA
## saaH2O           NA   NA
## faal.fold        NA   NA
## polar.req        NA   NA
## freq           0.14   NA

This code outputs the column/row with the most positive and negative correlation. SaaH2O and vol have the most positive correlation of 0.99. Polar.req and H2Opho.35 have the most negative correlation of -0.87.

which(cor_ == max(cor_, na.rm = T), arr.ind = T)

##        row col
## saaH2O   8   2

which(cor_ == min(cor_, na.rm = T), arr.ind = T)

##           row col
## polar.req  10   7

Plot 0: Scatterplot matrix

This function allows for an analysis of correlation between variables x and y (positive, negative, or no correlation).

panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor,...)
{
    usr <- par("usr"); on.exit(par(usr))
    par(usr = c(0, 1, 0, 1))
    r <- abs(cor(x, y))
    txt <- format(c(r, 0.123456789), digits = digits)[1]
    txt <- paste0(prefix, txt)
    if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
    text(0.5, 0.5, txt, cex = cex.cor * r)
}

This code generates a scatterplot matrix.

plot( ~ MW.da + vol + pol + H2Opho.34 + polar.req + saaH2O + faal.fold,
     data = na.omit(aa_dat[,c("MW.da","vol","pol","H2Opho.34","polar.req","saaH2O","faal.fold")]),
     panel = panel.smooth,
     upper.panel =panel.cor )

Plot 1: Replication Higgs and Attwood Figure 2.8

Figure 1

This generates a graph showing the relationship between Amino Acid Volume and pI.

#sets up blank plot and labels
plot(polar.req  ~ freq, data = aa_dat,
     xlab = "pI",
     ylab = "Volume",
     main = "Amino Acid Volume vs. pI ",
     col = 0)

#tilda uses plot() to plot the data points
text(polar.req ~ freq, 
     labels = aa, 
     data = aa_dat, 
     col = 1:20)

Figure 2: HA Figure 2.8 with ggpubr

ggscatter(y = "vol",
x = "isoelec",
size = "MW.da",
color = "bulk",
data = aa_dat,
xlab = "Isoelectric Point",
ylab = "Volume")

Figure 3: Highly Correlated Data

Code generates a regression line with formula N = B0 +B1T. The eclipse represents data that is within certain standard deviations.

ggscatter(y = "saaH2O",
          x = "vol",
          add = "reg.line", # line of best fit
          ellipse = TRUE, # data ellipse
          cor.coef = TRUE, # correlation coef
          data = aa_dat,
          xlab = "Volume",
          ylab = "saaH2O")

## `geom_smooth()` using formula 'y ~ x'

Figure 4: Non-linear Relationship

ggscatter(y = "saaH2O",
          x = "polar.req",
          color = "bulk",
          size = "MW.da",
          add  = "loess",
          data = aa_dat,
          xlab = "Polar requirement",
          ylab = "Surface area accessible to water in an unfolded peptide")

## `geom_smooth()` using formula 'y ~ x'

Figure 5: Non-linear Relationship (Log Scale)

Log allows us to look at non-linear data from a linear perspective by changing the scale.

aa_dat$saaH2O_log <- log(aa_dat[,"saaH2O"])
aa_dat$polar.req_log <- log(aa_dat[,"polar.req"])

ggscatter(y = "saaH2O",
          x = "polar.req_log",
          add = "reg.line",
          size = "MW.da",
          color = "bulk",
          xlab = "Log10 polar requirement",
          ylab = "Log10 saaH2O",
          data = aa_dat)

## `geom_smooth()` using formula 'y ~ x'

Figure 6: 3D Scatterplot

This code generates a 3D scatterplot, showing the relationship between frequency, polar requirement, and isoelectric point.

scatterplot3d(x = aa_dat$polar.req,
             y = aa_dat$isoelec,
             z = aa_dat$freq,
             highlight.3d = TRUE,
             type = "h")

Principal Component Analysis

Principal Component Analysis (PCA) is a method of cluster data together based on their similarity to reduce the dimensions of large data sets by scaling them down. The authors describe it as a way to visualize multidimensional data, focusing on its important properties (Higgs and Attwood 2005 pg 34).

PCA plots were made using R’s packages and manually through distance matrices.

Data Prep

This code creates a modified version of the original data frame without the categorical variables.

aa_dat2 = data.frame(vol, bulk, pol, isoelec, H2Opho.34, H2Opho.35, saaH2O, faal.fold)

Figure 7: Principal components analysis - base R

Makes PCA plot using base R.

pca.out <- prcomp(aa_dat2, scale = TRUE)
biplot(pca.out)

Figure 8: Principal components analysis - vegan

Runs PCA with vegan. “Scale = TRUE” sets the scale based on the numeric data

rda.out <- vegan::rda(aa_dat2, scale = TRUE)

Extracts PCA scores as rda_scores

rda_scores <- scores(rda.out)

Creates biplot(PCA plot)

biplot(rda.out, display = "sites", col = 0,)
orditorp(rda.out, display = "sites", labels = aa_dat$aa, cex = 1.2)
ordihull(rda.out,
group = aa_dat$charge,
col = 1:7,
lty = 1:7,
lwd = c(3,6))

Some of the biggest differences between this plot and Higgs and Attwood’s Figure 2.10 is the orientation of the plot. In addition, the scaling of both x and y axes is different.

Cluster Analysis

Cluster analysis is used to group objects based on their similarity by analyzing them through a property. Object that are closer together form “clusters”, allowing us to deduce that they are similar to each other based on distance.

UPGMA is a type of method used for cluster analysis. Standing for unweighted pair group method with arithmetic mean, it find the lowest mean distance between two objects and groups them together. It repeats the process until all objects grouped together, which can be used to plot a tree based on distances.

Hierarchal Cluster Analysis

aa_dat2 = aa_dat[,-c(1,2,11,12,13,14,15,16,17)]

row.names(aa_dat2) = paste(aa_dat$aa,sep = ".")

dist_euc = dist(aa_dat2,method = "euclidean")

clust_euc = hclust(dist_euc,method = "average") #average implies UPGMA

dendro_euc = as.dendrogram(clust_euc)

plot(dendro_euc,horiz = T,
     nodePar = list(pch = c(1,NA),
                    cex = 0.5,
                    lab.cex = 1.2))

Distance Matrices show distance between objects based on their similarities. Higher numbers in the table imply that the objects aren’t very similar to each other while zero implies that the objects are identical.

Euclidean distance refers to difference between a two points on a 2D line. Trees are plotted with lines and the distance between two objects can be identified by the euclidean distances.

While this dendogram and Higgs and Attwood’s dendogram is quite similar, there are some differences between the two. Some of the clades are arranged differently due to differences in amino acid properties.