Introduction

This script replicates the Higgs and Attwood’s analysis on the properties of amino acids. With the data from their paper, we look at the properties of the amino acids, compare their similarities, and their specific functions. In this script, we mainly used PCA, clustering analysis and other R function.

Preliminaries

Packages

library(ggplot2)       # It is a flexible package for elegant data visualization in R. One word to describe the relationship between ggplot2 and ggpubr is wrapper. 
## Warning: package 'ggplot2' was built under R version 4.0.3
library(ggpubr)        # makes ggplot2 easier.The ‘ggpubr’ package provides some easy-to-use functions for creating and customizing ‘ggplot2’- based publication ready plots.
## Warning: package 'ggpubr' was built under R version 4.0.3
library(vegan)         # PCA has functions doing/plotting PCA
## Warning: package 'vegan' was built under R version 4.0.3
## Loading required package: permute
## Warning: package 'permute' was built under R version 4.0.3
## Loading required package: lattice
## Warning: package 'lattice' was built under R version 4.0.3
## This is vegan 2.5-6
library(scatterplot3d) #make a 3d scatterplot
## Warning: package 'scatterplot3d' was built under R version 4.0.3
library(permute) #used for cluster analysis
library(lattice) #used for cluster analysis

Build the dataframe

These data are derived mostly from Higgs (2009) Table 1. They are similar to Higgs and Attwood 2005 but have extra columns.

Make the vectors.

# 1 letter code
aa        <-c('A','C','D','E','F','G','H','I','K','L','M','N',
              'P','Q','R','S','T','V','W','Y')


## molecular weight in dalton
MW.da     <-c(89,121,133,146,165,75,155,131,146,131,149,132,115,147,174,105,119,117,204,181)

## vol from van der Waals radii 
vol    <-c(67,86,91,109,135,48,118,124,135,124,124,96,90,
              114,148,73,93,105,163,141)

##  bulk – a measure of the shape of the side chain
bulk <-c(11.5,13.46,11.68,13.57,19.8,3.4,13.69,21.4,
              15.71,21.4,16.25,12.28,17.43,
              14.45,14.28,9.47,15.77,21.57,21.67,18.03)

## pol – a measure of the electric field strength around the molecule
pol  <-c(0,1.48,49.7,49.9,0.35,0,51.6,0.13,49.5,0.13,
              1.43,3.38,1.58,3.53,52,1.67,1.66,0.13,2.1,1.61)

## isoelec point
isoelec <-c(6,5.07,2.77,3.22,5.48,5.97,7.59,6.02,9.74,5.98,
                   5.74,5.41,6.3,5.65,10.76,5.68,6.16,5.96,5.89,5.66)

## 1st Hydrophobicity scale
H2Opho.34 <-c(1.8,2.5,-3.5,-3.5,2.8,-0.4,-3.2,4.5,-3.9,3.8,1.9,
                  -3.5,-1.6,-3.5,-4.5,-0.8,-0.7,4.2,-0.9,-1.3)

## 2nd Hydrophobicity scale
H2Opho.35 <-c(1.6,2,-9.2,-8.2,3.7,1,-3,3.1,-8.8,2.8,3.4,-4.8,
                  -0.2,-4.1,-12.3,0.6,1.2,2.6,1.9,-0.7)

## Surface area accessible to water in an unfolded peptide
saaH2O       <-c(113,140,151,183,218,85,194,182,211,180,204,158,
                 143,189,241,122,146,160,259,229)


## Fraction of accessible area lost when a protein folds
faal.fold    <-c(0.74,0.91,0.62,0.62,0.88,0.72,0.78,0.88,0.52,
                 0.85,0.85,0.63,0.64,0.62,0.64,0.66,0.7,0.86,0.85,0.76)



# Polar requirement 
polar.req    <-c(7,4.8,13,12.5,5,7.9,8.4,4.9,10.1,4.9,5.3,10,
                 6.6,8.6,9.1,7.5,6.6,5.6,5.2,5.4)

## relative frequency of occurance
## "The frequencies column shows the mean 
### percentage of each amino acid in the protein sequences 
### of modern organisms"
freq        <-c(7.8,1.1,5.19,6.72,4.39,6.77,2.03,6.95,6.32,
                10.15,2.28,4.37,4.26,3.45,5.23,6.46,5.12,7.01,1.09,3.3)

##
# charges
## un = Un-charged
## neg = negative
## pos = positive
charge<-c('un','un','neg','neg','un','un','pos','un','pos',
          'un','un','un','un','un','pos','un','un','un','un','un')

## hydropathy
hydropathy<-c('hydrophobic','hydrophobic','hydrophilic','
              hydrophilic','hydrophobic','neutral','neutral',
              'hydrophobic','hydrophilic','hydrophobic','hydrophobic',
              'hydrophilic','neutral',
              'hydrophilic','hydrophilic','neutral','neutral',
              'hydrophobic','hydrophobic','neutral')

## vol
vol.cat<-c('verysmall','small','small','medium',
              'verylarge','verysmall','medium','large','large',
              'large','large','small','small','medium','large',
              'verysmall',
              'small','medium','verylarge','verylarge')

## pol
pol.cat<-c('nonpolar','nonpolar','polar','polar',
                'nonpolar','nonpolar','polar','nonpolar',
                'polar','nonpolar','nonpolar','polar','nonpolar','polar',
                'polar','polar','polar','nonpolar','nonpolar','polar')

## chemical
chemical<-c('aliphatic','sulfur','acidic','acidic','aromatic',
            'aliphatic','basic','aliphatic','basic','aliphatic','sulfur',
            'amide','aliphatic','amide','basic','hydroxyl','hydroxyl',
            'aliphatic','aromatic','aromatic')

Build the dataframes.

### Build dataframe
aa_dat <- data.frame(aa, 
                     MW.da,vol,
                     bulk, pol, 
                     isoelec,H2Opho.34, H2Opho.35,
                     saaH2O, faal.fold, polar.req,
                     freq, charge, hydropathy,
                     vol.cat, pol.cat, chemical)
aa_dat
##    aa MW.da vol  bulk   pol isoelec H2Opho.34 H2Opho.35 saaH2O faal.fold
## 1   A    89  67 11.50  0.00    6.00       1.8       1.6    113      0.74
## 2   C   121  86 13.46  1.48    5.07       2.5       2.0    140      0.91
## 3   D   133  91 11.68 49.70    2.77      -3.5      -9.2    151      0.62
## 4   E   146 109 13.57 49.90    3.22      -3.5      -8.2    183      0.62
## 5   F   165 135 19.80  0.35    5.48       2.8       3.7    218      0.88
## 6   G    75  48  3.40  0.00    5.97      -0.4       1.0     85      0.72
## 7   H   155 118 13.69 51.60    7.59      -3.2      -3.0    194      0.78
## 8   I   131 124 21.40  0.13    6.02       4.5       3.1    182      0.88
## 9   K   146 135 15.71 49.50    9.74      -3.9      -8.8    211      0.52
## 10  L   131 124 21.40  0.13    5.98       3.8       2.8    180      0.85
## 11  M   149 124 16.25  1.43    5.74       1.9       3.4    204      0.85
## 12  N   132  96 12.28  3.38    5.41      -3.5      -4.8    158      0.63
## 13  P   115  90 17.43  1.58    6.30      -1.6      -0.2    143      0.64
## 14  Q   147 114 14.45  3.53    5.65      -3.5      -4.1    189      0.62
## 15  R   174 148 14.28 52.00   10.76      -4.5     -12.3    241      0.64
## 16  S   105  73  9.47  1.67    5.68      -0.8       0.6    122      0.66
## 17  T   119  93 15.77  1.66    6.16      -0.7       1.2    146      0.70
## 18  V   117 105 21.57  0.13    5.96       4.2       2.6    160      0.86
## 19  W   204 163 21.67  2.10    5.89      -0.9       1.9    259      0.85
## 20  Y   181 141 18.03  1.61    5.66      -1.3      -0.7    229      0.76
##    polar.req  freq charge                  hydropathy   vol.cat  pol.cat
## 1        7.0  7.80     un                 hydrophobic verysmall nonpolar
## 2        4.8  1.10     un                 hydrophobic     small nonpolar
## 3       13.0  5.19    neg                 hydrophilic     small    polar
## 4       12.5  6.72    neg \n              hydrophilic    medium    polar
## 5        5.0  4.39     un                 hydrophobic verylarge nonpolar
## 6        7.9  6.77     un                     neutral verysmall nonpolar
## 7        8.4  2.03    pos                     neutral    medium    polar
## 8        4.9  6.95     un                 hydrophobic     large nonpolar
## 9       10.1  6.32    pos                 hydrophilic     large    polar
## 10       4.9 10.15     un                 hydrophobic     large nonpolar
## 11       5.3  2.28     un                 hydrophobic     large nonpolar
## 12      10.0  4.37     un                 hydrophilic     small    polar
## 13       6.6  4.26     un                     neutral     small nonpolar
## 14       8.6  3.45     un                 hydrophilic    medium    polar
## 15       9.1  5.23    pos                 hydrophilic     large    polar
## 16       7.5  6.46     un                     neutral verysmall    polar
## 17       6.6  5.12     un                     neutral     small    polar
## 18       5.6  7.01     un                 hydrophobic    medium nonpolar
## 19       5.2  1.09     un                 hydrophobic verylarge nonpolar
## 20       5.4  3.30     un                     neutral verylarge    polar
##     chemical
## 1  aliphatic
## 2     sulfur
## 3     acidic
## 4     acidic
## 5   aromatic
## 6  aliphatic
## 7      basic
## 8  aliphatic
## 9      basic
## 10 aliphatic
## 11    sulfur
## 12     amide
## 13 aliphatic
## 14     amide
## 15     basic
## 16  hydroxyl
## 17  hydroxyl
## 18 aliphatic
## 19  aromatic
## 20  aromatic

Raw data exploration

Correlation Matrix

This code creates a correlation matrix. It uses the function to find out which variable has the most positive or negative correlation. saaH20 has the most postive correlation. Polar.req has the most negative correlation.

cor_ <- round(cor(aa_dat[,-c(1,13:17)]),2)
diag(cor_) <- NA
cor_[upper.tri(cor_)] <- NA
cor_
##           MW.da   vol  bulk   pol isoelec H2Opho.34 H2Opho.35 saaH2O faal.fold
## MW.da        NA    NA    NA    NA      NA        NA        NA     NA        NA
## vol        0.93    NA    NA    NA      NA        NA        NA     NA        NA
## bulk       0.55  0.73    NA    NA      NA        NA        NA     NA        NA
## pol        0.29  0.24 -0.20    NA      NA        NA        NA     NA        NA
## isoelec    0.20  0.36  0.08  0.27      NA        NA        NA     NA        NA
## H2Opho.34 -0.27 -0.08  0.44 -0.67   -0.20        NA        NA     NA        NA
## H2Opho.35 -0.25 -0.16  0.32 -0.85   -0.26      0.85        NA     NA        NA
## saaH2O     0.97  0.99  0.64  0.29    0.35     -0.18     -0.23     NA        NA
## faal.fold  0.11  0.18  0.49 -0.53   -0.18      0.84      0.79   0.12        NA
## polar.req -0.05 -0.19 -0.53  0.76   -0.11     -0.79     -0.87  -0.11     -0.81
## freq      -0.52 -0.30 -0.04 -0.01    0.02      0.26     -0.02  -0.38     -0.18
##           polar.req freq
## MW.da            NA   NA
## vol              NA   NA
## bulk             NA   NA
## pol              NA   NA
## isoelec          NA   NA
## H2Opho.34        NA   NA
## H2Opho.35        NA   NA
## saaH2O           NA   NA
## faal.fold        NA   NA
## polar.req        NA   NA
## freq           0.14   NA
which(cor_ == max(cor_, na.rm = T), arr.ind = T) #to find the most positive correlation
##        row col
## saaH2O   8   2
which(cor_ == min(cor_, na.rm = T), arr.ind = T) #to find the most negative correlation
##           row col
## polar.req  10   7

Plot 0: Scatterplot matrix

This code set up the function to plot the scatter matrix. And it also has the lines of best fit to allow us to see the relationship between the variables.

panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor,...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r)
}
aa_dat_new <- data.frame(MW.da,vol,pol,H2Opho.34, polar.req,saaH2O, faal.fold)
plot(aa_dat_new, upper.panel = panel.cor,
     panel = panel.smooth)

From the graph, we can know that volume and saaH2O has the s

the relationships between several variables are strong: saaH2O/volume, saaH2O/MW.da,and so forth. the relationships between several variables are weak: saaH2O/pol, pol/Mw.da and so forth. the relationship between variables is non-linear: Pol/H2Opo.34.

Plot 1: Replication Higgs and Attwood Figure 2.8

This plot is the replicated copy of the Figure 2.8 from the Higgs and Atwood paper showing the relationship between amino acid volume and pI.

plot(polar.req ~ freq, data = aa_dat, #draw data from the aa_data to make the plot
     xlab = "pI",            #label the x axis
     ylab = "Volume(Å3)",        #label the y axis
     main = "Plot of amino acid volume against pI", #name the title for the plot
     col = 0)
text(polar.req ~ freq, #main plotting
     labels = aa, 
     data = aa_dat, 
     col = 1:20)

Figure 2: HA Figure 2.8 with ggpubr

This plot shows the relationship between frequency and polar requirement. Since the graph uses various sizes to demonstrate the bulkiness, and different color to denote molecular weight, we are able to see 4 variables in one plot.

ggscatter(y = "polar.req",
x = "freq",
size = "polar.req",
color = "polar.req",
data = aa_dat,
xlab = "Frequency",
ylab = "Polar Requirement")

Figure 3: Highly correlated data

Firstly, identify 2 variables that are highly positively correlated, which are volume and molecular weight. Secondly, plot them with ggpubr. Thirdly, add the regression line and data ellipse. The general form of the regression line is y=kx+b. k is the slope and b is the y-intercept. If we have some new data, it possibly falls into the area encompassed by the data ellipse.

ggscatter(y = "MW.da",
x = "vol",
size = "polar.req",
add = "reg.line",      # add a regression line 
ellipse = TRUE,       # add a data ellipse
cor.coef = TRUE,      # add a correlation coef
# color = "freq",
data = aa_dat,
xlab = "Volme ",
ylab = "Molecular Weight in Dalton")
## `geom_smooth()` using formula 'y ~ x'

Figure 4: Non-linear relationship

This plot shows the non-linear relationship between polarity and molecular weight.

ggscatter(y = "pol",
          x = "MW.da",
          size = "vol",
          add = "loess",               # add a loess smoother
          cor.coef = TRUE,             # correlation coef
          color = "faal.fold",
          data = aa_dat,
          xlab = "Molecular Weight in Dalton",
          ylab = " Polarity")
## `geom_smooth()` using formula 'y ~ x'

Figure5: Non-linear relationship with log-transformation

The log scale transform the plot to its linear form by plotting the original data’s after performing the log function. After the log-transformation, we can get a a linear regression line in spite of the non-linear relationship.

aa_dat$pol_log <- log(aa_dat[,"pol"])
aa_dat$MW.da_log <- log(aa_dat[,"MW.da"])


ggscatter(y = "pol_log",
          x = "MW.da_log",
          size = "vol",
          add = "reg.line", # line of best fit
          color = "faal.fold",
          data = aa_dat,
          xlab = "Molecular Weight in Dalton",
          ylab = "Polarity")
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (stat_smooth).

Figure6:3D Scatterplot

The code below will create a 3D scatter plot that demonstrates the relationship between MW.da, volume, and saaH2O. These three variables have relatively strong correlations with one another. It means, as x(Mw.da) increases, y(volume) and z(saaH2O) also increase.

To interprete the correlation in a 3D scatterplot, we can look at each plane seperately. To find out the relationship between x(MW.da) and y(vol), we look at the bottom plane. To find out the relationship between x(MW.da) and z(saaH2O),we looking the plane where it forms a diagonal line.

scatterplot3d(x = aa_dat$MW.da,
              y = aa_dat$vol,
              z = aa_dat$saaH2O,
              xlab = "MW.da",
              ylab = "vol",
              zlab = "saaH2O",
              highlight.3d = TRUE,
              type = "h")

Principal component analysis

Higgs and Attwood define Principal component analysis (PCA) as “a way of visualizing the important features of multidimensional data sets, such as tables of amino acid properties”(2005, pg 34).PCA is a statistical technique used for reducing the dimmentionality of a large dataset. It is capable of recalculating coordinates for the data, then essentially flattening the data down. In biology or ecology, we usually use PCA for phylogenetic analysis.

Data Prep

A new dataframe, aa_dat2, is created with 8 variables. THose 8 variables are the ones in the Table 2.2.

aa_dat2 <- data.frame(vol,bulk,pol, isoelec, H2Opho.34, H2Opho.35, saaH2O,faal.fold)

Figure7: Principal components analysis- Base R

This code is for creating a PCA plot by a base R command and plotting the output.

pca.out <- prcomp(aa_dat2, scale = TRUE) #Run the PCA

biplot(pca.out) #Plot the output

Figure 8: Principal components analysis - Vegan

This code is for running the PCA using vegan and call the output rda.out.

rda.out <- vegan::rda(aa_dat2,
scale = TRUE)       # to properly scale numeric data by using their respective standard deviation
rda_scores <- scores(rda.out)   #extract the PCA score and name it as rda_scores
rda_scores
## $species
##                  PC1         PC2
## vol        0.1292455 -1.21867402
## bulk      -0.5240422 -1.00021659
## pol        1.0392368 -0.22161142
## isoelec    0.4272346 -0.51780748
## H2Opho.34 -1.1576058 -0.06083202
## H2Opho.35 -1.1909235  0.07563303
## saaH2O     0.2293130 -1.17930581
## faal.fold -1.0664891 -0.35159129
## 
## $sites
##                PC1         PC2
## sit1  -0.431697708  1.01348876
## sit2  -0.812161280  0.50497652
## sit3   0.958626669  0.71357979
## sit4   0.941582653  0.22249037
## sit5  -0.867271594 -0.76051369
## sit6  -0.093468837  1.79557675
## sit7   0.643286377 -0.33344292
## sit8  -1.013877712 -0.56001005
## sit9   1.450808627 -0.67736662
## sit10 -0.904499352 -0.52173690
## sit11 -0.668686375 -0.38778285
## sit12  0.462264010  0.53066191
## sit13  0.007416261  0.36866189
## sit14  0.458970843  0.05206945
## sit15  1.580353509 -1.08451465
## sit16 -0.014704047  1.08414678
## sit17 -0.184996200  0.37158066
## sit18 -0.979015329 -0.23223164
## sit19 -0.420924604 -1.37484059
## sit20 -0.112005910 -0.72479295
## 
## attr(,"const")
## [1] 3.511243

This code is for creating a biplot.

biplot(rda.out, display = "sites", col = 0,)
orditorp(rda.out, display = "sites", labels = aa_dat$aa, cex = 1.2)
ordihull(rda.out,
group = aa_dat$charge,
col = 1:7,
lty = 1:7,
lwd = c(3,6))

The plot created above is very different from the one in the in the Higgs and Atwood paper. 1. the scale for these two plots are different 2. the data points in the paper are more spread out comparing to the plot created above. 3. the orientation of the figure in the paper is the upside-down version of the plot created above. 4. in the original paper, there is no grouping indicated in the figure. But the plot created above make use of red, green and black lines to show the grouping to show the relationship.

Cluster analysis

The purpose of cluster analysis is classify different objects into similar groups.To categorize them into cluster, we base on the similarities or dissimilarities between the objects.

Unweighted pair-group method with arithmetic mean (UPGMA) is a type of clustering method that utilizes the hierarchical clustering method. Since UPGMA is a distance method, it needs a distance matrix. And it can be used for generating visual plots such as dendrograms.

UPGMA-Hierarchical Cluster Analysis

row.names(aa_dat2) <- paste(aa_dat$aa, sep = ".")
dist_euc <- dist(aa_dat2, 
                 method = "euclidean")

clust_euc<- hclust(dist_euc,
                    method = "average") #this method sets hclust to UPGMA

dendro_euc = as.dendrogram(clust_euc)
plot(dendro_euc,horiz = T,
     nodePar = list(pch = c(1,NA),
                    cex = 0.5,
                    lab.cex = 1.2))

This plot is a distance matrix to show distance between objects: amino acids. The distance is calculated by Euclidean distance matrix based on the different properties.An Euclidean distance matrix is a matrix containing squared distances between two points. As an equation, it can be expressed as a sum of squares.

This cluster analysis has some differences in the comparison to the one in the Higgs and Atwood paper.The difference is the relative proximity between amino acids. The distance between some amino acids gets changed: some are closer and some are further.