INTRODUCTION:

This script analyzes data on amino acids and attempts to replicate work done by Higgs and Attwood. This script analyzes correlations and similarities between amino acids to see how they function together in protein folding. Analysis of amino acids using bioinformatics is important since protein folding is a vital piece in the functioning of our bodies, and the use of bioinformatics in particular allows us to see relationships between amino acids and their properties. Additionally, we use this amino acid data to learn and understand key concepts of bioinformatics in R, these being Principal Component Analysis and various types of Clustering algorithms, in particular UPGMA. It is important to try and replicate work done by Higgs and Attwood to not only understand their work and code better, but to also in a sense check their work and to see if their data analysis is correct and can contribute to the world scientifically.

PRELIMINARIES:

Packages

#A very powerful R package that is widely used. Generally used for data creation, managing, etc,. Very complex.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.0.3
#A "wrapper" for the ggplot2 package. Allows for the easier use of ggplot2, essentially simplyfing ggplot2.
library(ggpubr)
## Warning: package 'ggpubr' was built under R version 4.0.3
#Used to make 3D scatterplots.
library(scatterplot3d)
## Warning: package 'scatterplot3d' was built under R version 4.0.3
#A package that is generally used for PCA (through the rda() function) and other features due to the rigidness of base R PCA.
library(vegan)
## Warning: package 'vegan' was built under R version 4.0.3
## Loading required package: permute
## Warning: package 'permute' was built under R version 4.0.3
## Loading required package: lattice
## This is vegan 2.5-6
#ggplot2 is an extremely powerful tool for plotting, and as such it is one of the cornerstones of Computational Biology work in R. However, its power also makes it very complex, which is where ggpubr comes in. ggpubr creates "wrappers", which are essentially functions that run other functions, for many of ggplot2's core functionality.

Building the Dataframe

This data is derived mostly from Higgs (2009) Table 1 regarding amino acids. They are similar to Higgs and Attwood 2005 but have extra columns. Nathan Brouwer added additional categorical varibles based on online information.

# 1 letter code
aa  <-c('A','C','D','E','F','G','H','I','K','L','M','N','P','Q','R','S','T','V','W','Y')

# molecular weight in dalton
MW.da     <-c(89,121,133,146,165,75,155,131,146,131,149,132,115,147,174,105,119,117,204,181)

# vol from van der Waals radii
volume    <-c(67,86,91,109,135,48,118,124,135,124,124,96,90,
              114,148,73,93,105,163,141)

# bulk – a measure of the shape of the side chain
bulkiness <-c(11.5,13.46,11.68,13.57,19.8,3.4,13.69,21.4,
              15.71,21.4,16.25,12.28,17.43,
              14.45,14.28,9.47,15.77,21.57,21.67,18.03)

## pol – a measure of the electric field strength around the molecule
polarity  <-c(0,1.48,49.7,49.9,0.35,0,51.6,0.13,49.5,0.13,
              1.43,3.38,1.58,3.53,52,1.67,1.66,0.13,2.1,1.61)

# isoelec point
isoelectric.pt <-c(6,5.07,2.77,3.22,5.48,5.97,7.59,6.02,9.74,5.98,
                   5.74,5.41,6.3,5.65,10.76,5.68,6.16,5.96,5.89,5.66)

# 1st Hydrophobicity scale
hydrophobe.34 <-c(1.8,2.5,-3.5,-3.5,2.8,-0.4,-3.2,4.5,-3.9,3.8,1.9,
                  -3.5,-1.6,-3.5,-4.5,-0.8,-0.7,4.2,-0.9,-1.3)

# 2nd Hydrophobicity scale
hydrophobe.35 <-c(1.6,2,-9.2,-8.2,3.7,1,-3,3.1,-8.8,2.8,3.4,-4.8,
                  -0.2,-4.1,-12.3,0.6,1.2,2.6,1.9,-0.7)

# Surface area accessible to water in an unfolded peptide
saaH2O       <-c(113,140,151,183,218,85,194,182,211,180,204,158,
                 143,189,241,122,146,160,259,229)

# Fraction of accessible area lost when a protein folds
faal.fold    <-c(0.74,0.91,0.62,0.62,0.88,0.72,0.78,0.88,0.52,
                 0.85,0.85,0.63,0.64,0.62,0.64,0.66,0.7,0.86,0.85,0.76)

# Polar requirement
polar.req    <-c(7,4.8,13,12.5,5,7.9,8.4,4.9,10.1,4.9,5.3,10,
                 6.6,8.6,9.1,7.5,6.6,5.6,5.2,5.4)

## relative frequency of occurrance
## "The frequencies column shows the mean
### percentage of each amino acid in the protein sequences
### of modern organisms"
freq        <-c(7.8,1.1,5.19,6.72,4.39,6.77,2.03,6.95,6.32,
                10.15,2.28,4.37,4.26,3.45,5.23,6.46,5.12,7.01,1.09,3.3)


# charges
## un = Un-charged
## neg = negative
## pos = positive
charge<-c('un','un','neg','neg','un','un','pos','un','pos',
          'un','un','un','un','un','pos','un','un','un','un','un')

# hydropathy
hydropathy<-c('hydrophobic','hydrophobic','hydrophilic','
              hydrophilic','hydrophobic','neutral','neutral','hydrophobic','hydrophilic','hydrophobic','hydrophobic','hydrophilic','  neutral',
              'hydrophilic','hydrophilic','neutral','neutral',
              'hydrophobic','hydrophobic','neutral')

# vol
volume.cat<-c('verysmall','small','small','medium',
              'verylarge','verysmall','medium','large','large',
              'large','large','small','small','medium','large','verysmall','small','medium','verylarge','verylarge')

# pol
polarity.cat<-c('nonpolar','nonpolar','polar','polar',
                'nonpolar','nonpolar','polar','nonpolar',
                'polar','nonpolar','nonpolar','polar','nonpolar','polar',
                'polar','polar','polar','nonpolar','nonpolar','polar')

# chemical
chemical<-c('aliphatic','sulfur','acidic','acidic','aromatic',
            'aliphatic','basic','aliphatic','basic','aliphatic','sulfur',
            'amide','aliphatic','amide','basic','hydroxyl','hydroxyl',
            'aliphatic','aromatic','aromatic')

Building the dataframe:

aa_dat <- data.frame(aa,MW.da,volume,bulkiness,
polarity,isoelectric.pt,hydrophobe.34,hydrophobe.35,
saaH2O,faal.fold, polar.req,freq,charge,hydropathy,volume.cat,polarity.cat,chemical)
aa_dat
##    aa MW.da volume bulkiness polarity isoelectric.pt hydrophobe.34
## 1   A    89     67     11.50     0.00           6.00           1.8
## 2   C   121     86     13.46     1.48           5.07           2.5
## 3   D   133     91     11.68    49.70           2.77          -3.5
## 4   E   146    109     13.57    49.90           3.22          -3.5
## 5   F   165    135     19.80     0.35           5.48           2.8
## 6   G    75     48      3.40     0.00           5.97          -0.4
## 7   H   155    118     13.69    51.60           7.59          -3.2
## 8   I   131    124     21.40     0.13           6.02           4.5
## 9   K   146    135     15.71    49.50           9.74          -3.9
## 10  L   131    124     21.40     0.13           5.98           3.8
## 11  M   149    124     16.25     1.43           5.74           1.9
## 12  N   132     96     12.28     3.38           5.41          -3.5
## 13  P   115     90     17.43     1.58           6.30          -1.6
## 14  Q   147    114     14.45     3.53           5.65          -3.5
## 15  R   174    148     14.28    52.00          10.76          -4.5
## 16  S   105     73      9.47     1.67           5.68          -0.8
## 17  T   119     93     15.77     1.66           6.16          -0.7
## 18  V   117    105     21.57     0.13           5.96           4.2
## 19  W   204    163     21.67     2.10           5.89          -0.9
## 20  Y   181    141     18.03     1.61           5.66          -1.3
##    hydrophobe.35 saaH2O faal.fold polar.req  freq charge
## 1            1.6    113      0.74       7.0  7.80     un
## 2            2.0    140      0.91       4.8  1.10     un
## 3           -9.2    151      0.62      13.0  5.19    neg
## 4           -8.2    183      0.62      12.5  6.72    neg
## 5            3.7    218      0.88       5.0  4.39     un
## 6            1.0     85      0.72       7.9  6.77     un
## 7           -3.0    194      0.78       8.4  2.03    pos
## 8            3.1    182      0.88       4.9  6.95     un
## 9           -8.8    211      0.52      10.1  6.32    pos
## 10           2.8    180      0.85       4.9 10.15     un
## 11           3.4    204      0.85       5.3  2.28     un
## 12          -4.8    158      0.63      10.0  4.37     un
## 13          -0.2    143      0.64       6.6  4.26     un
## 14          -4.1    189      0.62       8.6  3.45     un
## 15         -12.3    241      0.64       9.1  5.23    pos
## 16           0.6    122      0.66       7.5  6.46     un
## 17           1.2    146      0.70       6.6  5.12     un
## 18           2.6    160      0.86       5.6  7.01     un
## 19           1.9    259      0.85       5.2  1.09     un
## 20          -0.7    229      0.76       5.4  3.30     un
##                     hydropathy volume.cat polarity.cat  chemical
## 1                  hydrophobic  verysmall     nonpolar aliphatic
## 2                  hydrophobic      small     nonpolar    sulfur
## 3                  hydrophilic      small        polar    acidic
## 4  \n              hydrophilic     medium        polar    acidic
## 5                  hydrophobic  verylarge     nonpolar  aromatic
## 6                      neutral  verysmall     nonpolar aliphatic
## 7                      neutral     medium        polar     basic
## 8                  hydrophobic      large     nonpolar aliphatic
## 9                  hydrophilic      large        polar     basic
## 10                 hydrophobic      large     nonpolar aliphatic
## 11                 hydrophobic      large     nonpolar    sulfur
## 12                 hydrophilic      small        polar     amide
## 13                     neutral      small     nonpolar aliphatic
## 14                 hydrophilic     medium        polar     amide
## 15                 hydrophilic      large        polar     basic
## 16                     neutral  verysmall        polar  hydroxyl
## 17                     neutral      small        polar  hydroxyl
## 18                 hydrophobic     medium     nonpolar aliphatic
## 19                 hydrophobic  verylarge     nonpolar  aromatic
## 20                     neutral  verylarge        polar  aromatic

Raw Data Exploration

Correlation Matrix

Creating a correlation matrix

cor_ <- round(cor(aa_dat[,-c(1,13:17)]),2)
diag(cor_) <- NA
cor_[upper.tri(cor_)] <- NA
cor_
##                MW.da volume bulkiness polarity isoelectric.pt hydrophobe.34
## MW.da             NA     NA        NA       NA             NA            NA
## volume          0.93     NA        NA       NA             NA            NA
## bulkiness       0.55   0.73        NA       NA             NA            NA
## polarity        0.29   0.24     -0.20       NA             NA            NA
## isoelectric.pt  0.20   0.36      0.08     0.27             NA            NA
## hydrophobe.34  -0.27  -0.08      0.44    -0.67          -0.20            NA
## hydrophobe.35  -0.25  -0.16      0.32    -0.85          -0.26          0.85
## saaH2O          0.97   0.99      0.64     0.29           0.35         -0.18
## faal.fold       0.11   0.18      0.49    -0.53          -0.18          0.84
## polar.req      -0.05  -0.19     -0.53     0.76          -0.11         -0.79
## freq           -0.52  -0.30     -0.04    -0.01           0.02          0.26
##                hydrophobe.35 saaH2O faal.fold polar.req freq
## MW.da                     NA     NA        NA        NA   NA
## volume                    NA     NA        NA        NA   NA
## bulkiness                 NA     NA        NA        NA   NA
## polarity                  NA     NA        NA        NA   NA
## isoelectric.pt            NA     NA        NA        NA   NA
## hydrophobe.34             NA     NA        NA        NA   NA
## hydrophobe.35             NA     NA        NA        NA   NA
## saaH2O                 -0.23     NA        NA        NA   NA
## faal.fold               0.79   0.12        NA        NA   NA
## polar.req              -0.87  -0.11     -0.81        NA   NA
## freq                   -0.02  -0.38     -0.18      0.14   NA

The code below uses the which() function to find the maximum correlation value in the data frame and returns its location, while removing any NA values in the calculations. As we see, saaH2O and volume have the greatest correlation, while polar.req and hydrophobe.35 have the smallest correlation.

which(cor_ == max(cor_, na.rm = T), arr.ind = T)
##        row col
## saaH2O   8   2
which(cor_ == min(cor_, na.rm = T), arr.ind = T)
##           row col
## polar.req  10   7

Figure 0: Scatterplot Matrix

The function below is used to create lines of best fit in each of our correlation plots so that we can see linear relationships, positive correlations, negative correlations, etc within our scatterplot correlation matrix.

panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor,...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r)
}

The Correlation scatterplot matrix:

plot(~MW.da+volume+polarity+hydrophobe.34+polar.req+bulkiness,
     data = na.omit(aa_dat[,c("MW.da","volume","polarity","hydrophobe.34","polar.req","bulkiness")]),
     panel = panel.smooth,
     upper.panel =panel.cor )

With this data, we see a plethora of correlation relationships. Much of the data features non-linear relationships, such as Polarity and MW.data, or Bulkiness and Hydrophobe.34. Correlations that are greater than or equal to 0.7 have strong positive correlations, while correlations that are less than or equal to -0.7 have strong negative correlations.

Figure 1: Replication of Higgs and Attwood’s Figure 2.8

plot(volume  ~ isoelectric.pt, data = aa_dat,
     xlab = "pI",
     ylab = "Volume",
     main = "Protein Folding - Volume vs pI",
     col = 0)
text(volume ~ isoelectric.pt, 
     labels = aa, 
     data = aa_dat, 
     col = 1:20)

#Plot() is used to plot the look of the graph, such as the x and y labels, while text() is used to plot the actual data onto the graph made by plot()

This basic r plot shows amino acid volume against pI – two properties thought to be important in protein folding. The plot shows the acidic amino acids at low pI, the basic amino acids at high pI, and all the rest in the middle, which understandably corresponds to their acidic, basic, or neutral properties.

Figure 2: HA Figure 2.8 with ggpubr

We will now use ggpubr and, by proxy, ggplot2 to plot two numeric values of the amino acids.In this case, we will use polar.req and frequency to replicate Figure 2.8.

ggscatter(y = "polar.req",
x = "freq",
size = "polar.req",
color = "polar.req",
data = aa_dat,
xlab = "freq",
ylab = "polar.req")

Note: I wanted to use different numerical data, but they did not properly replicate the data, so I will be using the two provided in the PDF.

Figure 3: Highly Correlated Data

Using ggpubr, we will be plotting 2 variables that have highly positive correlations.The variables we will use in this case will MW.da and volume, which have a correlation of 0.93.

ggscatter(y = "volume",
          x = "MW.da",
          add = "reg.line", #Line of best fit
          ellipse = TRUE, #Data ellipse
          cor.coef = TRUE, #Correlation coefficient
          data = aa_dat,
          xlab = "MW.da",
          ylab = "Volume")
## `geom_smooth()` using formula 'y ~ x'

The general form of the line of best fit, or regression line, in statistics is N = B0 + B1T. A data ellipse is an ellipse used to represent the standard deviation of the data.

Figure 4: Non-linear relationship

Now we will examine a non-linear relationship, in particular volume and hydrophobe.34. Using ggpubr, we will use a loess smoother to smooth the non-linear line and see a trend in the data.

ggscatter(y = "volume",
          x = "hydrophobe.34",
          add  = "loess",
          color = "polarity",
          size = "polar.req",
          data = aa_dat)
## `geom_smooth()` using formula 'y ~ x'

Figure 5: Non-linear relationship on LOG scale

We see that the data in Figure 4 is still not a linear relationship although it is smoother. We instead use the natural log to re-scale the data and make it linear. First we need to create new values in the data frame that contain the log versions of volume and hydrophobe.34.

aa_dat$volume_log <- log(aa_dat[,"volume"])
aa_dat$hydrophobe.34_log <- log(aa_dat[,"hydrophobe.34"])
## Warning in log(aa_dat[, "hydrophobe.34"]): NaNs produced

Now the log-scale scatter plot:

ggscatter(y = "volume_log",
          x = "hydrophobe.34_log",
          add = "reg.line",
          color = "polarity",
          size = "polar.req",
          data = aa_dat)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 13 rows containing non-finite values (stat_smooth).
## Warning: Removed 13 rows containing missing values (geom_point).

The log transformation makes the non-linear relationship linear, and removes NA data that is created after the data becomes a natural log.

Figure 6: 3D Scatterplot

Now, we will look at the correlations of 3 variables using a third-dimensional scatter plot. We will use MW.da, volume and saaH2O, which have relatively strong correlations with each other.

scatterplot3d(x = aa_dat$MW.da,
                     y = aa_dat$volume,
                     z = aa_dat$saaH2Og,
              xlab = "aa_dat$MW.da",
              highlight.3d = TRUE,
              type = "h")

These three variables are all relatively strongly correlated with each other. This 3D scatter plot visualizes the relationships between all the variables on a three dimensional plane, and we can see that the values are relatively strongly correlated with each other.. The Z axis is visualized better with type = “h”, which drops lines from the x and y axis.

Principal Component Analysis

Principal Component Analysis, or PCA for short, is a very useful analysis tool that is used a lot in computational biology. Principal Component Analysis is essentially a dimension reduction tool that turns multi-dimensional data frames into a two dimensional plot that shows the greatest amount of variation in the data. An important part to know about PCA is that it is not aware of groupings, and that groupings must be done by hand. To quote Higgs and Attwood, “Principal component analysis (PCA) is a way of visualizing the important features of multidimensional data sets, such as tables of amino acid properties”(Higgs and Attwood 2005, pg 34-35).

In R, there are two primary ways to execute principal component analysis. The first is using base R, which is a rather simple method but is also somewhat inflexible and so not necessarily ideal for use. The second method alleviates the issues with base R, through the use of the package vegan and its specific function rda(). Vegan in general has many useful tools for grouping that base R does not.

Data Prep

Before creating PCA’s in base R and with vegan, we must first create a new data set that contains only numeric data. We will simply create a new data frame of just numeric values from the original data frame.

aa_dat2 <- aa_dat[,c("volume","bulkiness","polarity","isoelectric.pt","hydrophobe.34","hydrophobe.35","saaH2O","faal.fold")]

While there is 11 variables given to us, we will only be using the 8 in Higgs and Attwood’s table 2.2 to replicate it properly.

Figure 7: PCA using Base R

We will first demonstrate a PCA using base R components.

pca.out <- prcomp(aa_dat2[,-c(7,8)], scale = TRUE)

Now that we have the PCA created, we will now plot it:

biplot(pca.out)

Figure 8: PCA using the Vegan package

Since base PCA does not help with grouping, it is rather difficult to tell between the different groups. Thus, we use vegan to plot and group PCA. For another example see https://rpubs.com/brouwern/veganpca.

rda.out <- vegan::rda(aa_dat2[,-c(7,8)], scale = TRUE)
#Scale = True is important for correctly scaling numeric data by their standard deviations.

Now we must extract scores from the PCA:

rda_scores <- scores(rda.out)

Now the biplot of the PCA:

biplot(rda.out, display = "sites", col = 0,)
orditorp(rda.out, display = "sites", labels = aa_dat$aa, cex = 1.2)
ordihull(rda.out,
group = aa_dat$charge,
col = 1:7,
lty = 1:7,
lwd = c(3,6))

The graph created by our vegan PCA varies greatly from the Higgs and Attwood’s Figure 2.10 PCA. While mostly similar, it varies mostly in orientation and scaling. The orientation of the data is in the reverse of Figure 2.10. The scaling of the x and y axes are different as well. Most importantly is the fact that there are no groupings in Figure 2.10, while our vegan PCA allows us to see a large number of different groupings.

Cluster Analysis

To finish our analysis of our amino acid data set, we will use cluster analysis. Cluster analysis in biology often involves building tree diagrams (dendrograms), with the purpose being to group data into like groups so that we can clearly analyze similarities in our data and how they relate to help us understand functions, classifications, etc,.

One type of clustering analysis we will use here will be UPGMA, or unweighted pair group method with arithmetic mean. UPGMA is a hierarchical clustering analysis that uses distance-based algorithms to create a dendrogram of the data. Using a matrix of pairwise distances, UPGMA finds the smallest distance between two values in the distance matrix and groups them together, and proceeds to do this same algorithm for all values in the data until no more groups can be made. After the algorithm is complete, the groups can be made into a simple hierarchical dendrogram.

Hierarchical Cluster Analysis

UPGMA is a hierarchical cluster analysis, and we will be using the data set aa_dat2 to perform UPGMA. First we will make row names for the UPGMA.

row.names(aa_dat2) <- paste(aa_dat$aa,1:nrow(aa_dat2),sep = ".")
row.names(aa_dat2)
##  [1] "A.1"  "C.2"  "D.3"  "E.4"  "F.5"  "G.6"  "H.7"  "I.8"  "K.9"  "L.10"
## [11] "M.11" "N.12" "P.13" "Q.14" "R.15" "S.16" "T.17" "V.18" "W.19" "Y.20"

Now that we have row names, we will create a distance matrix using the euclidean algorithm.

dist_euc <- dist(aa_dat2[,-9], 
                 method = "euclidean")

A distance matrix is a two dimensional array or matrix that consists of the pairwise distances between to values in the data frame. The distance are generally found mathematically or can be seen like in the PCA plot above. The distance in PCA plots are generally euclidean distances. Euclidean distances are the length distance between two points on a plane or plot. It can be essentially thought of as the hypotenuse of a triangle.

Now the cluster analysis.

clust_euc <- hclust(dist_euc, method = "average")
#setting the method to "average" makes the hclust() function perform a UPGMA cluster analysis. Hclust() defaults to a complete cluster analysis.

Dendrogram plot of the cluster analysis.

par(mar = c(1,1,1,1))
dendro_euc <- as.dendrogram(clust_euc)
plot(dendro_euc,horiz = T,nodePar = list(pch = c(1,NA), 
                                          cex = 0.5, 
                                          lab.cex = 0.5))

Our dendrogram differs greatly from Higgs and Attwoods Plate 2.2 hierarchical cluster analysis in terms of topography. Higgs and Attwood included a heat map in their dendrogram, which helps relay more information about each cluster. Additionally, the clustering of groups varies greatly from Plate 2.2. This may be due to their use of CLUTO for their cluster analysis, which is a different package in R, while we simply used base R clustering algorithms.