Introduction

This code examines properties of amino acids, such as volume, isoelectric point, polarity and others to find trends and correlations. It first creates a matrix to examine the correlations between different variables. Then, it plots different variables against each other to see trends. Next, it performs a PCA to reduce the dimensionality of the data and visualize the correlations between all variables. Fianlly, through cluster analysis, we can use the data to find relationship and trends between properties of amino acids to offer explanations for their behaviors in nature. ## Preliminaries

Packages

ggpubr and ggplot2 are packages that make it more intuitive to create good looking graphs. ggpubr provides functions for creating and customizing ggplot2-based plots.

library(ggplot2)        # creating plots
library(ggpubr)         # creating plots
library(vegan)          # Community Ecology data

## Loading required package: permute

## Loading required package: lattice

## This is vegan 2.5-6

library(scatterplot3d)  # creating 3D scatter plots

Build the dataframe

These data represent different properties of amino acids. They include measurements of the size (mass, volume, etc), qualities and measurements of polarity, and measurements regarding hydrophobicity. There are both numerical and categorical data.

These data are derived mostly from Higgs (2009) Table 1. They are similar to Higgs and Attwood 2005 but have extra columns. Nathan Brouwer added additional categorical variables based on online information.

# 1 letter code
aa        <-c('A','C','D','E','F','G','H','I','K','L','M','N',
              'P','Q','R','S','T','V','W','Y')


## molecular weight in dalton
MW.da     <-c(89,121,133,146,165,75,155,131,146,131,149,132,115,147,174,105,119,117,204,181)

## vol from van der Waals radii 
vol    <-c(67,86,91,109,135,48,118,124,135,124,124,96,90,
              114,148,73,93,105,163,141)

##  bulk – a measure of the shape of the side chain
bulk <-c(11.5,13.46,11.68,13.57,19.8,3.4,13.69,21.4,
              15.71,21.4,16.25,12.28,17.43,
              14.45,14.28,9.47,15.77,21.57,21.67,18.03)

## pol – a measure of the electric field strength around the molecule
pol  <-c(0,1.48,49.7,49.9,0.35,0,51.6,0.13,49.5,0.13,
              1.43,3.38,1.58,3.53,52,1.67,1.66,0.13,2.1,1.61)

## isoelec point
isoelec <-c(6,5.07,2.77,3.22,5.48,5.97,7.59,6.02,9.74,5.98,
                   5.74,5.41,6.3,5.65,10.76,5.68,6.16,5.96,5.89,5.66)

## 1st Hydrophobicity scale
H2Opho.34 <-c(1.8,2.5,-3.5,-3.5,2.8,-0.4,-3.2,4.5,-3.9,3.8,1.9,
                  -3.5,-1.6,-3.5,-4.5,-0.8,-0.7,4.2,-0.9,-1.3)

## 2nd Hydrophobicity scale
H2Opho.35 <-c(1.6,2,-9.2,-8.2,3.7,1,-3,3.1,-8.8,2.8,3.4,-4.8,
                  -0.2,-4.1,-12.3,0.6,1.2,2.6,1.9,-0.7)

## Surface area accessible to water in an unfolded peptide
saaH2O       <-c(113,140,151,183,218,85,194,182,211,180,204,158,
                 143,189,241,122,146,160,259,229)


## Fraction of accessible area lost when a protein folds
faal.fold    <-c(0.74,0.91,0.62,0.62,0.88,0.72,0.78,0.88,0.52,
                 0.85,0.85,0.63,0.64,0.62,0.64,0.66,0.7,0.86,0.85,0.76)



# Polar requirement 
polar.req    <-c(7,4.8,13,12.5,5,7.9,8.4,4.9,10.1,4.9,5.3,10,
                 6.6,8.6,9.1,7.5,6.6,5.6,5.2,5.4)

## relative frequency of occurrence
## "The frequencies column shows the mean 
### percentage of each amino acid in the protein sequences 
### of modern organisms"
freq        <-c(7.8,1.1,5.19,6.72,4.39,6.77,2.03,6.95,6.32,
                10.15,2.28,4.37,4.26,3.45,5.23,6.46,5.12,7.01,1.09,3.3)

##
# charges
## un = Un-charged
## neg = negative
## pos = positive
charge<-c('un','un','neg','neg','un','un','pos','un','pos',
          'un','un','un','un','un','pos','un','un','un','un','un')

## hydropathy
hydropathy<-c('hydrophobic','hydrophobic','hydrophilic','
              hydrophilic','hydrophobic','neutral','neutral',
              'hydrophobic','hydrophilic','hydrophobic','hydrophobic',
              'hydrophilic','neutral',
              'hydrophilic','hydrophilic','neutral','neutral',
              'hydrophobic','hydrophobic','neutral')

## vol
vol.cat<-c('verysmall','small','small','medium',
              'verylarge','verysmall','medium','large','large',
              'large','large','small','small','medium','large',
              'verysmall',
              'small','medium','verylarge','verylarge')

## pol
pol.cat<-c('nonpolar','nonpolar','polar','polar',
                'nonpolar','nonpolar','polar','nonpolar',
                'polar','nonpolar','nonpolar','polar','nonpolar','polar',
                'polar','polar','polar','nonpolar','nonpolar','polar')

## chemical
chemical<-c('aliphatic','sulfur','acidic','acidic','aromatic',
            'aliphatic','basic','aliphatic','basic','aliphatic','sulfur',
            'amide','aliphatic','amide','basic','hydroxyl','hydroxyl',
            'aliphatic','aromatic','aromatic')

Building data frame

aa_dat <- data.frame(aa,MW.da,vol,bulk,
pol,isoelec,H2Opho.34,H2Opho.35,
saaH2O,faal.fold, polar.req,freq,charge,hydropathy,vol.cat,pol.cat,chemical)

Raw data exploration

Creating the data matrix

cor_ <- round(cor(aa_dat[,-c(1,13:17)]),2)
diag(cor_) <- NA
cor_[upper.tri(cor_)] <- NA
cor_

##           MW.da   vol  bulk   pol isoelec H2Opho.34 H2Opho.35 saaH2O faal.fold
## MW.da        NA    NA    NA    NA      NA        NA        NA     NA        NA
## vol        0.93    NA    NA    NA      NA        NA        NA     NA        NA
## bulk       0.55  0.73    NA    NA      NA        NA        NA     NA        NA
## pol        0.29  0.24 -0.20    NA      NA        NA        NA     NA        NA
## isoelec    0.20  0.36  0.08  0.27      NA        NA        NA     NA        NA
## H2Opho.34 -0.27 -0.08  0.44 -0.67   -0.20        NA        NA     NA        NA
## H2Opho.35 -0.25 -0.16  0.32 -0.85   -0.26      0.85        NA     NA        NA
## saaH2O     0.97  0.99  0.64  0.29    0.35     -0.18     -0.23     NA        NA
## faal.fold  0.11  0.18  0.49 -0.53   -0.18      0.84      0.79   0.12        NA
## polar.req -0.05 -0.19 -0.53  0.76   -0.11     -0.79     -0.87  -0.11     -0.81
## freq      -0.52 -0.30 -0.04 -0.01    0.02      0.26     -0.02  -0.38     -0.18
##           polar.req freq
## MW.da            NA   NA
## vol              NA   NA
## bulk             NA   NA
## pol              NA   NA
## isoelec          NA   NA
## H2Opho.34        NA   NA
## H2Opho.35        NA   NA
## saaH2O           NA   NA
## faal.fold        NA   NA
## polar.req        NA   NA
## freq           0.14   NA

This code returns variables with the most positive and negative correlations and their locations in the matrix. saaH2O and volume have the most positive correlation, and polar.req and hydrophobe.35 have most negative correlation.

which(cor_ == max(cor_, na.rm = T), arr.ind = T)

##        row col
## saaH2O   8   2

Plot 0: Scatterplot matrix

This code builds and adds each scatterplot and correlation coefficient to the matrix.

panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor,...)
{
    usr <- par("usr"); on.exit(par(usr))
    par(usr = c(0, 1, 0, 1))
    r <- abs(cor(x, y))
    txt <- format(c(r, 0.123456789), digits = digits)[1]
    txt <- paste0(prefix, txt)
    if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
    text(0.5, 0.5, txt, cex = cex.cor * r)
}

plot(~MW.da+vol+pol+H2Opho.34+polar.req+isoelec,
     panel = panel.smooth,
     upper.panel = panel.cor)

Strong correlations exist between MW.da and vol, polar.req and H2Oph.34, and pol and polar.req. There are weak correlations between vol and H2Opho.34, MW.da and polar.req, and isoelec and MW.da. There is a nonlinear correlation between MW.da and H2Opho.34.

Plot 1: Replication Higgs and Attwood Figure 2.8

This is a plot of amino acid volume vs. pI which are thought to be important in protein folding, with the different points representing amino acids.

plot(vol ~ isoelec, data = aa_dat,  # main plotting function
     xlab = "pI",
     ylab = "Volume (Å)",
     main = "Amino Acid pI vs Volume",
     col = 0)
text(vol ~ isoelec, # adds text label
     labels = aa, 
     data = aa_dat, 
     col = 1:20)

Figure 2: HA Figure 2.8 with ggpubr

ggscatter(y = "vol", x = "isoelec",
          size = "bulk",
          color = "pol",
          data = aa_dat,
          xlab = "pI",
ylab = "Volume")

Figure 3: Highly correlated data

The regression line follows the basic form y=mx+b and best fits the data, having the smallest overall distance from the line to the data points.

The data ellipse shows the range in which the most values lie on average in a bivariate data set, with variance in two directions.

ggscatter(y = "vol",
          x = "MW.da",
          add = "reg.line", # line of best fit
          ellipse = TRUE, # data ellipse
          cor.coef = TRUE, # correlation coef
          data = aa_dat,
          xlab = "Volume",
          ylab = "Molecular Weight (daltons)")

## `geom_smooth()` using formula 'y ~ x'

Figure 4: Non-linear relationship

ggscatter(y = "MW.da",
          x = "vol",
          size = "saaH2O",
          color = "faal.fold",
          data = aa_dat,
          xlab = "Volume",
          ylab = "Molecular Weight (daltons)",
          add = "loess")

## `geom_smooth()` using formula 'y ~ x'

Figure 5: Non-linear relationship on LOG scale

The log transformation transform skewed data to approximately conform to normality by taking the log of the values of the variables.

MW.da_log <- log(MW.da)
vol_log <- log(vol)

aa_dat_log <- data.frame(MW.da_log,vol_log,saaH2O,faal.fold)

ggscatter(y = "MW.da_log",
          x = "vol_log",
          size = "saaH2O",
          color = "faal.fold",
          data = aa_dat_log,
          xlab = "Volume",
          ylab = "Molecular Weight (daltons)",
          cor.coef = TRUE)

Figure 6: 3D Scatterplot

There appears to be a positive correlation between volume and MW, surface area and MW, and surface area and volume. There does not appear to be a very clear correlation between faal.fold and the other variables.

scatterplot3d(x = saaH2O,
              y = vol,
              z = H2Opho.34,
              highlight.3d = TRUE,
              type="h")

Principal component analysis

Principal component analysis (PCA) is a tool to reduce the dimensions of data, and to visualize the similarities between things while filtering out noise. This is done by transforming the data points “first by scaling them and shifting them to the origin, and then by rotating them in such a way that the points are spread out as much as possible”(Higgs and Attwood 2005, pg 26).

The 2 basic ways we made PCA plots in this class are with the base R prcomp() and biplot() functions, which makes a 2D PCA plot. This is a simple, but not very flexible, way to make pca plots. Alternatively, we also use the rda() function in the R package vegan which does PCA and has other nice features not in the base R function.

Data prep

This data frame drops the categorical variables from the original data frame, leaving only the numeric variables like in the original Higgs and Attwood table.

aa_dat2 <- data.frame(MW.da,vol,bulk,
pol,isoelec,H2Opho.34,H2Opho.35,
saaH2O,faal.fold, polar.req,freq)

Figure 7: Principal components analysis - base R

pca.out <- prcomp(aa_dat2, scale = TRUE)
biplot(pca.out)

Figure 8: Principal components analysis - vegan

scale = TRUE scales species to unit variance (like correlations).

rda.out <- vegan::rda(aa_dat2, scale = TRUE)
rda_scores <- scores(rda.out)

TODO: The plot in Higgs and Atwood has similar groupings such as F with M, V, L, and I, and R with K. However, this plot is slightly more compressed as it is scaled differently. Also, the clustering of categorical polarity data (polar vs nonpolar) is represented in this plot, showing some overlap as in Higgs and Atwood, but a fairly clear separation.

biplot(rda.out, display = "sites", col = 0,)
orditorp(rda.out, display = "sites", labels = aa_dat$aa, cex = 1.2)
ordihull(rda.out,
group = aa_dat$pol.cat, col = 1:7,
lty = 1:7,
lwd = c(3,6))

Cluster analysis

Cluster analysis is used to place data points into groups or clusters, based on the data.The objects in a cluster tend to be similar to each other in some sense, and objects in different clusters tend to be dissimilar.

To build a UPGMA tree, you first construct a distance matrix and cluster a pair taxa by shortest distance. Then, recalculate a new average distance with the new cluster and other taxa, and make a new distance matrix. Repeat this until only two clusters remain.

Hierarchical cluster analysis

A distance matrix is a two-dimensional array filled with the distances between the elements of a set. , The distances are calculated pairwise.

Euclidean distance is between two points in Euclidean space as a number, the length of a line segment between the two points. It can be calculated from the Cartesian coordinates of the points using the Pythagorean theorem.

Compared to the cluster diagram produced by Higgs and Atwood (2005) in Plate 2.2, the one produced here is similar in that it pairs I with L, P with T, and a few other equal pairings. It differs with many of the pairings however. Higgs and Atwood used CLUTO to construct the cluster analysis, as well as implementing a heat map with a second tree of the variables.

row.names(aa_dat2) <- paste(aa_dat$aa,1:nrow(aa_dat),sep = ".")

dist_euc <- dist(aa_dat2, 
                 method = "euclidean")
clust_euc <- hclust(dist_euc, 
       method = "average") # specifies UPGMA method

plot(clust_euc)

Replicating Higgs and Attwood’s analysis on the properties of amino acids

Brandon Duncan, with contributions from Nathan Brouwer

11/12/2020