Replicating Higgs and Attwood’s analysis on the properties of amino acids

Introduction

Below is code that will first create a data-frame of variables measuring characteristics of amino acids, and then utilize several data visualization methods to determine relationships between these variables. Highly correlated data will be visualized, a 3D plot will be used to show multiple variables and a PCA analysis will show groupings of data by variance in principal components, among other analyses. These analyses are biologically important because amino acids are the primary structure of proteins, the macro-molecules responsible for almost all biological functions. Determing relationships between characteristics of amino acids can help to understand why proteins fold in the way the do or why one protein with a high content of a certain amino acid binds to one receptor but not another. The implications are endless with proper analysis and adequate data.

Preliminaries

Packages

ggpubr expands on the functionalities of ggplot2 but makes the package easier to use. It also provides functions that create ‘publication ready’ graphics.

library(ggplot2)               #Contains functions for plotting/graphing data that are easier, better than base graphics

## Warning: package 'ggplot2' was built under R version 4.0.2

library(ggpubr)                #Makes it easier to implement ggplot

## Warning: package 'ggpubr' was built under R version 4.0.2

library(vegan)                 #Contains functions for plotting PCA

## Warning: package 'vegan' was built under R version 4.0.2

## Loading required package: permute

## Warning: package 'permute' was built under R version 4.0.2

## Loading required package: lattice

## This is vegan 2.5-6

library(scatterplot3d)         #Contains functions for creating a 3d scatterplot

## Warning: package 'scatterplot3d' was built under R version 4.0.2

library(permute)               #Allows us to create spatial grid designs as well as time series'
library(lattice)               #Contains functions useful for creating graphs to display multivariate relationshipsa

Build the dataframe

The data in each vector represent the values of a characteristic of amino acids, for each acid. For example, MW.da lists the molecular weight in daltons of each amino acid (aa). The volume, bulk, polarity as well as many other variables are also recorded. Listing each value in the order that the acids are listed in the vector aa, will allow use to combine all the vectors into a dataframe, with each value corresponding to the correct aa. - Note - These data are derived mostly from Higgs (2009) Table 1. They are similar to Higgs and Attwood 2005 but have extra columns. Nathan Brouwer added additional categorical varibles based on online information.

Make the vectors.

# 1 letter code
aa <-c('A','C','D','E','F','G','H','I','K','L','M','N', 'P','Q','R','S','T','V','W','Y')
## molecular weight in dalton
MW.da <-c(89,121,133,146,165,75,155,131,146,131,149,132,115,147,174,105,119,117,204,181)

 ## vol from van der Waals radii
vol <-c(67,86,91,109,135,48,118,124,135,124,124,96,90, 114,148,73,93,105,163,141)

##  bulk – a measure of the shape of the side chain
bulk <-c(11.5,13.46,11.68,13.57,19.8,3.4,13.69,21.4, 15.71,21.4,16.25,12.28,17.43,14.45,14.28,9.47,15.77,21.57,21.67,18.03)

## pol – a measure of the electric field strength around the molecule
pol <-c(0,1.48,49.7,49.9,0.35,0,51.6,0.13,49.5,0.13, 1.43,3.38,1.58,3.53,52,1.67,1.66,0.13,2.1,1.61)

## isoelec point
isoelec <-c(6,5.07,2.77,3.22,5.48,5.97,7.59,6.02,9.74,5.98, 5.74,5.41,6.3,5.65,10.76,5.68,6.16,5.96,5.89,5.66)

## 1st Hydrophobicity scale
H2Opho.34 <-c(1.8,2.5,-3.5,-3.5,2.8,-0.4,-3.2,4.5,-3.9,3.8,1.9, -3.5,-1.6,-3.5,-4.5,-0.8,-0.7,4.2,-0.9,-1.3)

## 2nd Hydrophobicity scale
H2Opho.35 <-c(1.6,2,-9.2,-8.2,3.7,1,-3,3.1,-8.8,2.8,3.4,-4.8, -0.2,-4.1,-12.3,0.6,1.2,2.6,1.9,-0.7)

## Surface area accessible to water in an unfolded peptide
saaH2O <-c(113,140,151,183,218,85,194,182,211,180,204,158, 143,189,241,122,146,160,259,229)

## Fraction of accessible area lost when a protein folds
faal.fold <-c(0.74,0.91,0.62,0.62,0.88,0.72,0.78,0.88,0.52, 0.85,0.85,0.63,0.64,0.62,0.64,0.66,0.7,0.86,0.85,0.76)

# Polar requirement
polar.req <-c(7,4.8,13,12.5,5,7.9,8.4,4.9,10.1,4.9,5.3,10, 6.6,8.6,9.1,7.5,6.6,5.6,5.2,5.4)

## Relative frequency of occurrence - shows mean percentage of each amino acid in the protein sequences of modern organisms"
freq <-c(7.8,1.1,5.19,6.72,4.39,6.77,2.03,6.95,6.32,
                10.15,2.28,4.37,4.26,3.45,5.23,6.46,5.12,7.01,1.09,3.3)

## un = Un-charged
## neg = negative
## pos = positive 
charge <- c('un','un','neg','neg','un','un','pos','un','pos','un','un','un','un','un','pos','un','un','un','un','un')

## hydropathy
hydropathy<-c('hydrophobic','hydrophobic','hydrophilic',' hydrophilic','hydrophobic','neutral','neutral',
              'hydrophobic','hydrophilic','hydrophobic','hydrophobic', 'hydrophilic','neutral',
              'hydrophilic','hydrophilic','neutral','neutral', 'hydrophobic','hydrophobic','neutral')

## Volume
vol.cat<-c('verysmall','small','small','medium', 'verylarge','verysmall','medium','large','large', 'large','large',
           'small','small','medium','large', 'verysmall', 'small','medium','verylarge','verylarge')

## Polarity
pol.cat<-c('nonpolar','nonpolar','polar','polar', 'nonpolar','nonpolar','polar','nonpolar', 'polar','nonpolar',
           'nonpolar','polar','nonpolar','polar', 'polar','polar','polar','nonpolar','nonpolar','polar')

## Chemical
chemical<-c('aliphatic','sulfur','acidic','acidic','aromatic', 'aliphatic','basic','aliphatic','basic','aliphatic','sulfur',                            'amide','aliphatic','amide','basic','hydroxyl','hydroxyl', 'aliphatic','aromatic','aromatic')

Build the Dataframe

aa_dat <- data.frame(aa, MW.da, vol, bulk, pol, isoelec, H2Opho.34, H2Opho.35, saaH2O, faal.fold, polar.req, freq)

Raw data exploration

Distance Matrix/Correlation Matrix

cor_ <- round(cor(aa_dat[,-c(1,13:17)]),2) 
diag(cor_) <- NA
cor_[upper.tri(cor_)] <- NA
cor_

##           MW.da   vol  bulk   pol isoelec H2Opho.34 H2Opho.35 saaH2O faal.fold
## MW.da        NA    NA    NA    NA      NA        NA        NA     NA        NA
## vol        0.93    NA    NA    NA      NA        NA        NA     NA        NA
## bulk       0.55  0.73    NA    NA      NA        NA        NA     NA        NA
## pol        0.29  0.24 -0.20    NA      NA        NA        NA     NA        NA
## isoelec    0.20  0.36  0.08  0.27      NA        NA        NA     NA        NA
## H2Opho.34 -0.27 -0.08  0.44 -0.67   -0.20        NA        NA     NA        NA
## H2Opho.35 -0.25 -0.16  0.32 -0.85   -0.26      0.85        NA     NA        NA
## saaH2O     0.97  0.99  0.64  0.29    0.35     -0.18     -0.23     NA        NA
## faal.fold  0.11  0.18  0.49 -0.53   -0.18      0.84      0.79   0.12        NA
## polar.req -0.05 -0.19 -0.53  0.76   -0.11     -0.79     -0.87  -0.11     -0.81
## freq      -0.52 -0.30 -0.04 -0.01    0.02      0.26     -0.02  -0.38     -0.18
##           polar.req freq
## MW.da            NA   NA
## vol              NA   NA
## bulk             NA   NA
## pol              NA   NA
## isoelec          NA   NA
## H2Opho.34        NA   NA
## H2Opho.35        NA   NA
## saaH2O           NA   NA
## faal.fold        NA   NA
## polar.req        NA   NA
## freq           0.14   NA

The code below determines which variables have the most positive correlation and which have the most negative correlation. - The variables with the most positive correlation are surface area accessible to water (saaH2O) and volume (vol). This makes logical sense, since volume and surface area are positively related. - The variables with the most negative correlation are polarity requirement (polar req) and Hydrophobicity (H20pho.35).

#Finding the variables that have the most positive correlation
which(cor_ == max(cor_, na.rm = T), arr.ind = T)

##        row col
## saaH2O   8   2

cor_[which(cor_ == max(cor_, na.rm = T), arr.ind = T) ]

## [1] 0.99

#Finding the variables that have the most negative correlation
which(cor_ == min(cor_, na.rm = T), arr.ind = T)

##           row col
## polar.req  10   7

cor_[which(cor_ == min(cor_, na.rm = T), arr.ind = T)]

## [1] -0.87

Plot 0: Scatterplot matrix

This function panel.cor allows us to create an upper half of a distance matrix that displays the correlation between the two variables instead of the scatter plot. Further, it displays the correlation in a greater font it the correlation itself is larger.

panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor,...) {
  usr <- par("usr"); on.exit(par(usr))
  par(usr = c(0, 1, 0, 1))
  r <- abs(cor(x, y))
  txt <- format(c(r, 0.123456789), digits = digits)[1] 
  txt <- paste0(prefix, txt)
  if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
  text(0.5, 0.5, txt, cex = cex.cor * r) }

plot(~MW.da+vol+pol+H2Opho.34+H2Opho.35+polar.req+bulk+isoelec,
     data = na.omit(aa_dat[, c("MW.da", "vol", "pol", "H2Opho.34", "H2Opho.35", "polar.req", "bulk", "isoelec")]),
     panel = panel.smooth,
     upper.panel = panel.cor)

It seems that Vol and MW.da have a very strong relationship as well as H2Opho.35 and pol and H2Opho.34. The correlations are easily determined because the upper half of the matrix displays them in large text.

Plot 1: Replication Higgs & Atwood Figure 2.8

This plot represents the relationship between the volume of an amino acid and its isoelectric point. We can see that there seems to be a strong, relatively vertical relationship where the volume does nat change, but most amino acids have an isoelectric point around 6.

ggscatter(y = "vol", x = "isoelec",            #Choose which variables to add to which axis
          data = aa_dat,
          ylab = "Volume",
          xlab = "Isoelectric Point",          #ylab and xlab arguments change the labels of the axes
          main = "Volume vs. Isoelectric point",
          col = 1:20)

## Warning in if (color %in% names(data) & is.null(add.params$color))
## add.params$color <- color: the condition has length > 1 and only the first
## element will be used

Plot 2: GGpubr version of Higgs & Atwood Figure 2.8

 ggscatter(y = "vol", x = "isoelec",
          size = "bulk",
          color = "freq",
          data = aa_dat,
          xlab = "Isoelectric Point",
          ylab = "Volume")

Plot 3: Highly Correlated Variables

The general mathematical formula for a regression line is y = b0 - b1x. b0 in this case is the y intercept and b1 is the slope. It is simply a different notation for y = mx+b.
A data ellipse is used to show where a certain percentage of the data lies. In the below example, no confidence level is specified, however if 95% was specified the ellipse would encircel 95% of the data.

ggscatter(y = "saaH2O", x = "vol",
          size = "polar.req",
          add = "reg.line",                        # line of best fit ellipse = TRUE, adds line of best fit
          ellipse = TRUE,                          # adds the ellipse to the graph
          cor.coef = TRUE,                         # adds corr. coeff to the graph
          data = aa_dat,
          xlab = "Volume from Van Der Waals Radii",
          ylab = "Surface Area accesible to water, in unfolded peptide")

## `geom_smooth()` using formula 'y ~ x'

Figure 4: Non-linear relationship

ggscatter(data = aa_dat, y = "freq", x = "bulk",
          size = "vol",
          color = "pol",
          xlab = "A Measure of the Shape of the Side Chain",
          ylab = "Relative Frequency of Occurrence",
          main = "Frequency vs. Bulk") + geom_smooth(method=loess)

## `geom_smooth()` using formula 'y ~ x'

Figure 5: Non-linear relationship on LOG scale

Below, the same data as in Figure 4 is plotted, with the exception that it is transformed to a natural log scale. This method is used to make nonlinear data into linear data. In this case, the method does not work too well, however we can see there is a different grouping of data points.

aa_dat$log_freq <- log(aa_dat[,"freq"])
aa_dat$log_bulk <- log(aa_dat[,"bulk"])
ggscatter(data = aa_dat, y = "log_freq", x = "log_bulk",
          size = "vol",
          color = "pol",
          add = "reg.line",
          xlab = "A Measure of the Shape of the Side Chain",
          ylab = "Relative Frequency of Occurrence",
          main = "Frequency vs. Bulk")

## `geom_smooth()` using formula 'y ~ x'

Figure 6: 3D Scatterplot

Using a 3D scatterplot allows us to examine the relationships between three variables all at once. Here we see that Volume, Bulk, and Accessible surface area all have relatively strong, positive correlations with each other.

scatterplot3d(x = aa_dat$vol,
              y = aa_dat$bulk,
              z = aa_dat$saaH2O,
              highlight.3d = TRUE,
              type = "h",
              ylab = "Bulk", xlab = "Volume", zlab = "Accessible Surface Area")

Principal component analysis

Principal Component Analysis (PCA) is a method used to reduce high dimensional data to a few principal components that carry the majority of variance in the data. According to Higgs and Atwood, PCA offers “a way of combining the information from all eight properties into a two dimensional graph.” PCA is useful because visualization of complex relationships is possible.
The two methods we have used in this class to create a PCA graph are with base R graphics and with the Vegan package. Base R makes use of the function prcomp() while the vegan package gives us the ability to use the function rda(). The benefit of the vegan package is that it distinguishes between groups of variables.

Data prep

To make this new dataframe we initialize a new dataframe to the old one (aa_dat), but then index by a vector of only the columns that we want to include (the numeric variables) in our new dataframe.

aa_dat2 <- aa_dat[,c("vol", "bulk", "pol", "isoelec", "H2Opho.34", "H2Opho.35", "saaH2O", "faal.fold", "polar.req", "freq")]
aa_dat2

##    vol  bulk   pol isoelec H2Opho.34 H2Opho.35 saaH2O faal.fold polar.req  freq
## 1   67 11.50  0.00    6.00       1.8       1.6    113      0.74       7.0  7.80
## 2   86 13.46  1.48    5.07       2.5       2.0    140      0.91       4.8  1.10
## 3   91 11.68 49.70    2.77      -3.5      -9.2    151      0.62      13.0  5.19
## 4  109 13.57 49.90    3.22      -3.5      -8.2    183      0.62      12.5  6.72
## 5  135 19.80  0.35    5.48       2.8       3.7    218      0.88       5.0  4.39
## 6   48  3.40  0.00    5.97      -0.4       1.0     85      0.72       7.9  6.77
## 7  118 13.69 51.60    7.59      -3.2      -3.0    194      0.78       8.4  2.03
## 8  124 21.40  0.13    6.02       4.5       3.1    182      0.88       4.9  6.95
## 9  135 15.71 49.50    9.74      -3.9      -8.8    211      0.52      10.1  6.32
## 10 124 21.40  0.13    5.98       3.8       2.8    180      0.85       4.9 10.15
## 11 124 16.25  1.43    5.74       1.9       3.4    204      0.85       5.3  2.28
## 12  96 12.28  3.38    5.41      -3.5      -4.8    158      0.63      10.0  4.37
## 13  90 17.43  1.58    6.30      -1.6      -0.2    143      0.64       6.6  4.26
## 14 114 14.45  3.53    5.65      -3.5      -4.1    189      0.62       8.6  3.45
## 15 148 14.28 52.00   10.76      -4.5     -12.3    241      0.64       9.1  5.23
## 16  73  9.47  1.67    5.68      -0.8       0.6    122      0.66       7.5  6.46
## 17  93 15.77  1.66    6.16      -0.7       1.2    146      0.70       6.6  5.12
## 18 105 21.57  0.13    5.96       4.2       2.6    160      0.86       5.6  7.01
## 19 163 21.67  2.10    5.89      -0.9       1.9    259      0.85       5.2  1.09
## 20 141 18.03  1.61    5.66      -1.3      -0.7    229      0.76       5.4  3.30

Figure 7: Principal components analysis - base R

Run a PCA and create a PCA plot using base R command.

pca1 <- prcomp(aa_dat2[,-c(9,10)], scale = TRUE)
biplot(pca1)

Figure 8: Principal components analysis - vegan

For another example see https://rpubs.com/brouwern/veganpca

Run the PCA using vegan. Call the output rda.out

rda.out <- vegan::rda(aa_dat2[,-c(9:10)], scale = TRUE) #scale = TRUE normalizes the data so mean and std. dev are 1 before doing PCA

Extract what are known as the PCA score - don’t worry about what this means. Call the output rda_scores

rda_scores <- scores(rda.out)

The most obvious difference between this plot and Figure 2.10 from Higgs & Atwood 2005 are the axis differences, tick marks and shading of the graph. Beyond that, there are slight differences in the distribution of the data, most likely due to this difference in axis scales.

biplot(rda.out, display = "sites", col = 0,)
orditorp(rda.out, display = "sites", labels = aa_dat$aa, cex = 1.2)

ordihull(rda.out, group = aa_dat$pol.cat, col = 1:7, lty = 1:7, lwd = c(3,6))

Cluster analysis

The purpose of cluster analysis is to group data points based on similarity or most often distance. The idea is that data that are closer, with the smallest distance, are the most closely related. A final cluster analysis can then help determine the relative relatedness between groups and determine ancestral relationships. This can have implications for predicting future changes as well as answering present questions about the interaction between groups.
UPGMA works by first finding the two groups that are the closest to each other via a distance matrix. It then considers this new joint-group as a single group and determines the difference to all other groups. This works well to create a phylogenetic tree, where the difference between groups can be used to orient the physical structure.

Hierarchical cluster analysis

row_names <- paste(aa, 1:nrow(aa_dat2))
row_names

##  [1] "A 1"  "C 2"  "D 3"  "E 4"  "F 5"  "G 6"  "H 7"  "I 8"  "K 9"  "L 10"
## [11] "M 11" "N 12" "P 13" "Q 14" "R 15" "S 16" "T 17" "V 18" "W 19" "Y 20"

dist_euc <- dist(aa_dat2, method = "euclidean")

A distance matrix is a diagonal matrix that has the pairwise differences between sequences as values. It is a diagonal plot because the flipping which variable is in the row and which is in the column does not change the distance between the two variables.
The euclidean distance is the straigth distance between two elements on a coordinate plane. While the Manhattan distance includes the sum of the x and y distance, the euclidean distance can be thought of as the hypotenuse distance, or the distance of the straight line connecting the two points.

clustFinal <- hclust(dist_euc, method = "average")  # method = "average" is setting the method to UPGMA
plot(as.dendrogram(clustFinal))

While Higgs & Atwood include a heat map within their cluster analysis, here there is no such addition. The layout of each cluster diagram is also different.