Introduction

In this script we are recreating Higgs and Attwood’s analysis on the properties of amino acids. We are using different graphs and methods of data analysis, such as PCA and hierarchical cluster analysis, to analyze data of amino acids. The different graphs and methods of analyzing data allow us to see the relationships between amino acids and their properties.

Preliminaries

ggpubr is a wrapper method for ggplot2. It allows for the creation of ggplot2-based graphs for people with non-advanced prgrammings backgrounds.

Packages

library(ggplot2)      # makes nice graphs 
library(ggpubr)        # makes ggplot2 easier
library(vegan)         # PCA has functions doing/plotting PCA

## Loading required package: permute

## Loading required package: lattice

## This is vegan 2.5-6

library(scatterplot3d) # for creating 3D scatterplots

Build the dataframe

These data are derived mostly from Higgs (2009) Table 1. They are similar to Higgs and Attwood 2005 but have extra columns. Nathan Brouwer added additional categorical variables based on online information.

Making the vectors

# 1 letter code
aa        <-c('A','C','D','E','F','G','H','I','K','L','M','N',
              'P','Q','R','S','T','V','W','Y')


## molecular weight in dalton
MW.da     <-c(89,121,133,146,165,75,155,131,146,131,149,132,115,147,174,105,119,117,204,181)

## vol from van der Waals radii 
vol    <-c(67,86,91,109,135,48,118,124,135,124,124,96,90,
              114,148,73,93,105,163,141)

##  bulk – a measure of the shape of the side chain
bulk <-c(11.5,13.46,11.68,13.57,19.8,3.4,13.69,21.4,
              15.71,21.4,16.25,12.28,17.43,
              14.45,14.28,9.47,15.77,21.57,21.67,18.03)

## pol – a measure of the electric field strength around the molecule
pol  <-c(0,1.48,49.7,49.9,0.35,0,51.6,0.13,49.5,0.13,
              1.43,3.38,1.58,3.53,52,1.67,1.66,0.13,2.1,1.61)

## isoelec point
isoelec <-c(6,5.07,2.77,3.22,5.48,5.97,7.59,6.02,9.74,5.98,
                   5.74,5.41,6.3,5.65,10.76,5.68,6.16,5.96,5.89,5.66)

## 1st Hydrophobicity scale
H2Opho.34 <-c(1.8,2.5,-3.5,-3.5,2.8,-0.4,-3.2,4.5,-3.9,3.8,1.9,
                  -3.5,-1.6,-3.5,-4.5,-0.8,-0.7,4.2,-0.9,-1.3)

## 2nd Hydrophobicity scale
H2Opho.35 <-c(1.6,2,-9.2,-8.2,3.7,1,-3,3.1,-8.8,2.8,3.4,-4.8,
                  -0.2,-4.1,-12.3,0.6,1.2,2.6,1.9,-0.7)

## Surface area accessible to water in an unfolded peptide
saaH2O       <-c(113,140,151,183,218,85,194,182,211,180,204,158,
                 143,189,241,122,146,160,259,229)


## Fraction of accessible area lost when a protein folds
faal.fold    <-c(0.74,0.91,0.62,0.62,0.88,0.72,0.78,0.88,0.52,
                 0.85,0.85,0.63,0.64,0.62,0.64,0.66,0.7,0.86,0.85,0.76)



# Polar requirement 
polar.req    <-c(7,4.8,13,12.5,5,7.9,8.4,4.9,10.1,4.9,5.3,10,
                 6.6,8.6,9.1,7.5,6.6,5.6,5.2,5.4)

## relative frequency of occurance
## "The frequencies column shows the mean 
### percentage of each amino acid in the protein sequences 
### of modern organisms"
freq        <-c(7.8,1.1,5.19,6.72,4.39,6.77,2.03,6.95,6.32,
                10.15,2.28,4.37,4.26,3.45,5.23,6.46,5.12,7.01,1.09,3.3)

##
# charges
## un = Un-charged
## neg = negative
## pos = positive
charge<-c('un','un','neg','neg','un','un','pos','un','pos',
          'un','un','un','un','un','pos','un','un','un','un','un')

## hydropathy
hydropathy<-c('hydrophobic','hydrophobic','hydrophilic','
              hydrophilic','hydrophobic','neutral','neutral',
              'hydrophobic','hydrophilic','hydrophobic','hydrophobic',
              'hydrophilic','neutral',
              'hydrophilic','hydrophilic','neutral','neutral',
              'hydrophobic','hydrophobic','neutral')

## vol
vol.cat<-c('verysmall','small','small','medium',
              'verylarge','verysmall','medium','large','large',
              'large','large','small','small','medium','large',
              'verysmall',
              'small','medium','verylarge','verylarge')

## pol
pol.cat<-c('nonpolar','nonpolar','polar','polar',
                'nonpolar','nonpolar','polar','nonpolar',
                'polar','nonpolar','nonpolar','polar','nonpolar','polar',
                'polar','polar','polar','nonpolar','nonpolar','polar')

## chemical
chemical<-c('aliphatic','sulfur','acidic','acidic','aromatic',
            'aliphatic','basic','aliphatic','basic','aliphatic','sulfur',
            'amide','aliphatic','amide','basic','hydroxyl','hydroxyl',
            'aliphatic','aromatic','aromatic')

Building the dataframes

This makes a dataframe with all of these vectors as columns. The vectors were created above

### Build dataframe
aa_dat <- data.frame(aa, 
                     MW.da,vol,
                     bulk, pol, 
                     isoelec,H2Opho.34, H2Opho.35,
                     saaH2O, faal.fold, polar.req,
                     freq, charge, hydropathy,
                     vol.cat, pol.cat, chemical)
aa_dat

##    aa MW.da vol  bulk   pol isoelec H2Opho.34 H2Opho.35 saaH2O faal.fold
## 1   A    89  67 11.50  0.00    6.00       1.8       1.6    113      0.74
## 2   C   121  86 13.46  1.48    5.07       2.5       2.0    140      0.91
## 3   D   133  91 11.68 49.70    2.77      -3.5      -9.2    151      0.62
## 4   E   146 109 13.57 49.90    3.22      -3.5      -8.2    183      0.62
## 5   F   165 135 19.80  0.35    5.48       2.8       3.7    218      0.88
## 6   G    75  48  3.40  0.00    5.97      -0.4       1.0     85      0.72
## 7   H   155 118 13.69 51.60    7.59      -3.2      -3.0    194      0.78
## 8   I   131 124 21.40  0.13    6.02       4.5       3.1    182      0.88
## 9   K   146 135 15.71 49.50    9.74      -3.9      -8.8    211      0.52
## 10  L   131 124 21.40  0.13    5.98       3.8       2.8    180      0.85
## 11  M   149 124 16.25  1.43    5.74       1.9       3.4    204      0.85
## 12  N   132  96 12.28  3.38    5.41      -3.5      -4.8    158      0.63
## 13  P   115  90 17.43  1.58    6.30      -1.6      -0.2    143      0.64
## 14  Q   147 114 14.45  3.53    5.65      -3.5      -4.1    189      0.62
## 15  R   174 148 14.28 52.00   10.76      -4.5     -12.3    241      0.64
## 16  S   105  73  9.47  1.67    5.68      -0.8       0.6    122      0.66
## 17  T   119  93 15.77  1.66    6.16      -0.7       1.2    146      0.70
## 18  V   117 105 21.57  0.13    5.96       4.2       2.6    160      0.86
## 19  W   204 163 21.67  2.10    5.89      -0.9       1.9    259      0.85
## 20  Y   181 141 18.03  1.61    5.66      -1.3      -0.7    229      0.76
##    polar.req  freq charge                  hydropathy   vol.cat  pol.cat
## 1        7.0  7.80     un                 hydrophobic verysmall nonpolar
## 2        4.8  1.10     un                 hydrophobic     small nonpolar
## 3       13.0  5.19    neg                 hydrophilic     small    polar
## 4       12.5  6.72    neg \n              hydrophilic    medium    polar
## 5        5.0  4.39     un                 hydrophobic verylarge nonpolar
## 6        7.9  6.77     un                     neutral verysmall nonpolar
## 7        8.4  2.03    pos                     neutral    medium    polar
## 8        4.9  6.95     un                 hydrophobic     large nonpolar
## 9       10.1  6.32    pos                 hydrophilic     large    polar
## 10       4.9 10.15     un                 hydrophobic     large nonpolar
## 11       5.3  2.28     un                 hydrophobic     large nonpolar
## 12      10.0  4.37     un                 hydrophilic     small    polar
## 13       6.6  4.26     un                     neutral     small nonpolar
## 14       8.6  3.45     un                 hydrophilic    medium    polar
## 15       9.1  5.23    pos                 hydrophilic     large    polar
## 16       7.5  6.46     un                     neutral verysmall    polar
## 17       6.6  5.12     un                     neutral     small    polar
## 18       5.6  7.01     un                 hydrophobic    medium nonpolar
## 19       5.2  1.09     un                 hydrophobic verylarge nonpolar
## 20       5.4  3.30     un                     neutral verylarge    polar
##     chemical
## 1  aliphatic
## 2     sulfur
## 3     acidic
## 4     acidic
## 5   aromatic
## 6  aliphatic
## 7      basic
## 8  aliphatic
## 9      basic
## 10 aliphatic
## 11    sulfur
## 12     amide
## 13 aliphatic
## 14     amide
## 15     basic
## 16  hydroxyl
## 17  hydroxyl
## 18 aliphatic
## 19  aromatic
## 20  aromatic

Raw data exploration

Symmetric matrix

This creates a symmetric matrix and puts coefficients in the dataframe. It also puts NA on the diagonal and for repeating data

cor_ <- round(cor(aa_dat[,-c(1,13:17)]),2)
diag(cor_) <- NA
cor_[upper.tri(cor_)] <- NA
cor_

##           MW.da   vol  bulk   pol isoelec H2Opho.34 H2Opho.35 saaH2O faal.fold
## MW.da        NA    NA    NA    NA      NA        NA        NA     NA        NA
## vol        0.93    NA    NA    NA      NA        NA        NA     NA        NA
## bulk       0.55  0.73    NA    NA      NA        NA        NA     NA        NA
## pol        0.29  0.24 -0.20    NA      NA        NA        NA     NA        NA
## isoelec    0.20  0.36  0.08  0.27      NA        NA        NA     NA        NA
## H2Opho.34 -0.27 -0.08  0.44 -0.67   -0.20        NA        NA     NA        NA
## H2Opho.35 -0.25 -0.16  0.32 -0.85   -0.26      0.85        NA     NA        NA
## saaH2O     0.97  0.99  0.64  0.29    0.35     -0.18     -0.23     NA        NA
## faal.fold  0.11  0.18  0.49 -0.53   -0.18      0.84      0.79   0.12        NA
## polar.req -0.05 -0.19 -0.53  0.76   -0.11     -0.79     -0.87  -0.11     -0.81
## freq      -0.52 -0.30 -0.04 -0.01    0.02      0.26     -0.02  -0.38     -0.18
##           polar.req freq
## MW.da            NA   NA
## vol              NA   NA
## bulk             NA   NA
## pol              NA   NA
## isoelec          NA   NA
## H2Opho.34        NA   NA
## H2Opho.35        NA   NA
## saaH2O           NA   NA
## faal.fold        NA   NA
## polar.req        NA   NA
## freq           0.14   NA

Find the max and the min values in the data frame.

The saaH2O(Surface area accessible to water in an unfolded peptide) and vol (volume from van der Waals radii) have the most positive correlation of 0.99. The pol.req (polar requirement) and H2Opho.35 (2nd Hydrophobicity scale) have the most negative correlation with -0.87.

which(cor_ == max(cor_, na.rm = T), arr.ind = T)

##        row col
## saaH2O   8   2

which(cor_ == min(cor_, na.rm = T), arr.ind = T)

##           row col
## polar.req  10   7

This function plots the correlation coefficients etween variables. In the scatterplot matrix below, upper.panel = panel.cor and the upper half of the matrix is just the correlation coefficients instead of the same graphs again.

panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor,...)
{
    usr <- par("usr"); on.exit(par(usr))
    par(usr = c(0, 1, 0, 1))
    r <- abs(cor(x, y))
    txt <- format(c(r, 0.123456789), digits = digits)[1]
    txt <- paste0(prefix, txt)
    if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
    text(0.5, 0.5, txt, cex = cex.cor * r)
}

Figure 0: Scatterplot Matrix

In addition to the strongest correlation written about earlier, Molecular weight in daltons (MW.da) and volume have a very storng correlation along with MW.da and saaH2O. Volume and the 1st Hydrophobicity scale(H2O.34) have a very weak correlation. Same with MW.da and polar.req, there is a very weak correlation. H2O.34 and pol have a very nonlinear relationship in addition to polar.req and faal.fold(Fraction of accessible area lost when a protein folds)

#make scatterplot matrix
plot(aa_dat[ ,c("MW.da", "vol", "pol", "H2Opho.34", "polar.req", "saaH2O", "faal.fold")],
     panel = panel.smooth,
      upper.panel = panel.cor )

Figure 1: Replication Higgs and Attwood Figure 2.8

Plot of the amino acid polar requirement versus the percentage of each amino acid in the protein sequences

#plot() creates blank plot
plot(polar.req  ~ freq, data = aa_dat, 
     xlab = "Polar Requirement", #adding x axis label
     ylab = "Frequency", #adding y axis label
     main = "Relative Frequency of Occurance vs Polar Requirement",
     col = 0)
#text() uses same tilda notation as plot() to plot the labels 
text(polar.req ~ freq, 
     labels = aa, 
     data = aa_dat, 
     col = 1:20)

Figure 2: HA Figure 2.8 with ggpubr

ggscatter(y = "polar.req",
x = "freq",
shape = "chemical",
color = "polar.req",
data = aa_dat,
xlab = "Frequency",
ylab = "Polar Requirement")

## Warning: The shape palette can deal with a maximum of 6 discrete values because
## more than 6 becomes difficult to discriminate; you have 7. Consider
## specifying shapes manually if you must have them.

## Warning: Removed 2 rows containing missing values (geom_point).

### Figure 3: Highly Correlated Data The general mathematical form of the regression line is y = mx + b. where y is the dependent variable and x is the explanatory variable. The slope of the line is b and m is the y intercept.

A data ellipse is similar to a line of regression in that it tries to situate the data and show a general trend. A data ellipse superimposes the normal-probability contours over a scatterplot of the data

ggscatter(y = "polar.req",
x = "freq",
size = "polar.req",
add = "reg.line", # line of best fit
ellipse = TRUE, # data ellipse
cor.coef = TRUE, # correlation coefficient
data = aa_dat,
xlab = "Frequency",
ylab = "Polar Requirement")

## `geom_smooth()` using formula 'y ~ x'

?ellipse

Figure 4: Non-Linear Relationship

ggscatter(y = "pol",
          x = "H2Opho.34",
          add  = "loess",
          ylab = "Pol",
          xlab = "1st Hydrophobicity scale",
          color = "freq",
          size = "faal.fold",
          data = aa_dat)

## `geom_smooth()` using formula 'y ~ x'

Figure 5: Non-Linear relationship on LOG scale

Taking the log transformation both of the variables can be used to re-scale variables or make the relationships linear

aa_dat$pol_log <- log(aa_dat$pol)
aa_dat$H2Opho.34_log <- log(aa_dat$H2Opho.34)

## Warning in log(aa_dat$H2Opho.34): NaNs produced

Plotting the same scatterplot as Figure 4 but with the log of the variables

ggscatter(y = "pol_log",
          x = "H2Opho.34_log",
          add  = "reg.line",
          ylab = "Logarithmic Pol",
          xlab = "Log of 1st Hydrophobicity scale",
          size = "faal.fold",
          color = "freq",
          data = aa_dat)

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 14 rows containing non-finite values (stat_smooth).

## Warning: Removed 13 rows containing missing values (geom_point).

Figure 6: Creating a 3D scatterplot from the aa_dat dataframe

All three of these variables have very high correlations with each other, all above 0.90, therefore I decided to plot all of them together.

scatterplot3d(x = aa_dat$MW.da,
                     y = aa_dat$saaH2O,
                     z = aa_dat$vol,
              highlight.3d = TRUE, 
              type = "h")

Principal Component Analysis

PCA is a dimensionality-reduction method that is used on larger data sets to make it easier to work with yet keeping most of the information. The number of variables are reduced to make the data easier to visualize and analyze. Higgs and Attwood describe it as follows, “Each row in the data matrix can be thought of as the coordinates of a point in P-dimensional space. The whole data set is a cloud of these points. The PCA method transforms this cloud of points first by scaling them and shifting them to the origin, and then by rotating them in such a way that the points are spread out as much as possible, and the structure in the data is made easier to see.” (Higgs and Attwood 2005, pg 26) In class we have used PCA for phylogenetic trees.

Figure 7: PCA - base R

I used column names to create a subset of the previous dataframe aa_dat.

aa_dat2 <- aa_dat[,c("vol", "bulk", "pol", "H2Opho.34", "isoelec", "H2Opho.35", "saaH2O", "faal.fold")]

Running a PCA and creating a PCA plot using base R command

pca.out <- prcomp(na.omit(aa_dat2), scale = TRUE)

Plotting the output

biplot(pca.out)

Figure 8: PCA vegan

Scale centers and/or scales the columns of a numeric matrix

rda.out <- vegan::rda(aa_dat2, scale = TRUE)

Extract PCA scores

rda_scores <- scores(rda.out)

There are some differences between this plot and the Higgs and Attwood plot on page 28. A main difference being that there is a very different sizes in the clusters in the plot below. There are three clusters, however one is significantly larger and the other two hold only two or three points. The largest cluster also spans all four quadranst while the other two smaller clusters stick to one quadrant.

Making a biplot, or PCA plot.

biplot(rda.out, display = "sites", col = 0,)
orditorp(rda.out, display = "sites", labels = aa_dat$aa, cex = 1.2)
ordihull(rda.out, 
         group = aa_dat$charge,
         col = 1:7,
         lty = 1:7,
         lwd = c(3,6))

Hierarchical Cluster Analysis

The purpose of cluster analysis is to see trends in the data and see which points of data are close to each other. The data is placed into groups, or clusters, for statistical data analysis.

UPGMA

UPGMA is a simple hierarchical clustering method. After a distance matrix has been created, find the two species with the least amount of differences. An average distance is caluclated and the two species become a cluster, or a unit. Then the process is continued with each cluster acting as one unit and being compared to other species to find a new cluster.

Creating row names

row.names(aa_dat2) <- paste(aa_dat$aa,
                            sep = ".")

A distance matrix is a matrix containing the distances between elements of a set. These distances are taken pairwise meaning the distance between each pair of the species. Euclidean distance is the straight line distance between two points on a plane.

Creating a distance matrix; calculating distance using euclidean method

dist_euc <- dist(aa_dat2,
                 method = "euclidean")

Using the hclust() function to perform cluster analysis

clust_euc <- hclust(dist_euc, method = "average")

Plotting the dendogram

There are a few differences between my graph below and the one in Higgs and Attwood 2005. One main one I am noticing is there are about 6 clusters in the paper and there are also about 6 in my graph, they are just structured differently.

dendro_euc <- as.dendrogram(clust_euc)
plot(dendro_euc, horiz = T, 
     nodePar = list(pch = c(1,NA),
                    cex = 0.5,
                    lab.cex = 1.2))

Replicating Higgs and Attwood’s analysis on the properties of amino acids

Michelle Prem, with contributions from Nathan Brouwer

11/12/2020