Abstract

In this script, I follow the steps of certain data exploration tools in order to better understand the relationships between the 20 amino acids by studying their residues. Using data from the paper written by Higgs and Attwood, I use many functions in base r and other packages that allow me to replicate some of the analysis done in the paper. We see that indeed we can use specific data methods to accurately describe the similarities and differences between amino acid residues.

Loading necessary packages

Both ggpubr and ggplot2 are good packages that visualize data. ‘ggpubr’ is a package that creates wrappers for most of ggplot2’s core functionality and you must explicitly define the variables in the function with quotation marks. Here we load such packages.

library(permute) #generates permutations 
library(lattice) #visualization system for data
library(vegan) #useful functions for ecological studies
## This is vegan 2.5-6
library(scatterplot3d) #for nice scatterplots 
library(ggpubr) #wrapper of ggplot2
## Loading required package: ggplot2
library(ggplot2) #creates plots

Set-up variables and dataframe

The data below represents many different properties of the amino acid residues. Both the categorical and numerical variables defined below will be used to help us understand any correlations that they may share.

# 1 letter code
aa <-c('A','C','D','E','F','G','H','I','K','L','M','N', 'P','Q','R','S','T','V','W','Y')

## molecular weight in dalton
MW.da <-c(89,121,133,146,165,75,155,131,146,131,149,132,115,147,174,105,119,117,204,181)

## vol from van der Waals radii
vol <-c(67,86,91,109,135,48,118,124,135,124,124,96,90, 114,148,73,93,105,163,141)

## bulk – a measure of the shape of the side chain
bulk <-c(11.5,13.46,11.68,13.57,19.8,3.4,13.69,21.4, 15.71,21.4,16.25,12.28,17.43,
14.45,14.28,9.47,15.77,21.57,21.67,18.03)

## pol – a measure of the electric field strength around the molecule
pol <-c(0,1.48,49.7,49.9,0.35,0,51.6,0.13,49.5,0.13, 1.43,3.38,1.58,3.53,52,1.67,1.66,0.13,2.1,1.61)

## isoelec point
isoelec <-c(6,5.07,2.77,3.22,5.48,5.97,7.59,6.02,9.74,5.98, 5.74,5.41,6.3,5.65,10.76,5.68,6.16,5.96,5.89,5.66)

## 1st Hydrophobicity scale
H2Opho.34 <-c(1.8,2.5,-3.5,-3.5,2.8,-0.4,-3.2,4.5,-3.9,3.8,1.9, -3.5,-1.6,-3.5,-4.5,-0.8,-0.7,4.2,-0.9,-1.3)

## 2nd Hydrophobicity scale
H2Opho.35 <-c(1.6,2,-9.2,-8.2,3.7,1,-3,3.1,-8.8,2.8,3.4,-4.8, -0.2,-4.1,-12.3,0.6,1.2,2.6,1.9,-0.7)

## Surface area accessible to water in an unfolded peptide
saaH2O <-c(113,140,151,183,218,85,194,182,211,180,204,158, 143,189,241,122,146,160,259,229)

## Fraction of accessible area lost when a protein folds
faal.fold <-c(0.74,0.91,0.62,0.62,0.88,0.72,0.78,0.88,0.52, 0.85,0.85,0.63,0.64,0.62,0.64,0.66,0.7,0.86,0.85,0.76)

# Polar requirement
polar.req <-c(7,4.8,13,12.5,5,7.9,8.4,4.9,10.1,4.9,5.3,10, 6.6,8.6,9.1,7.5,6.6,5.6,5.2,5.4)

## relative frequency of occurrence
## "The frequencies column shows the mean
### percentage of each amino acid in the protein sequences ### of modern organisms"
freq <-c(7.8,1.1,5.19,6.72,4.39,6.77,2.03,6.95,6.32,
10.15,2.28,4.37,4.26,3.45,5.23,6.46,5.12,7.01,1.09,3.3)

##
# charges
## un = Un-charged
## neg = negative
## pos = positive 
charge <-c('un','un','neg','neg','un','un','pos','un','pos','un','un','un','un','un','pos','un','un','un','un','un')

## hydropathy
hydropathy<-c('hydrophobic','hydrophobic','hydrophilic',' 
              hydrophilic','hydrophobic','neutral','neutral',
              'hydrophobic','hydrophilic','hydrophobic','hydrophobic',
              'hydrophilic','neutral',
              'hydrophilic','hydrophilic','neutral','neutral',
              'hydrophobic','hydrophobic','neutral')

## vol
vol.cat<-c('verysmall','small','small','medium', 
           'verylarge','verysmall','medium','large','large',
              'large','large','small','small','medium','large',
              'verysmall',
              'small','medium','verylarge','verylarge')

## pol
pol.cat<-c('nonpolar','nonpolar','polar','polar','nonpolar','nonpolar','polar','nonpolar','polar','nonpolar','nonpolar','polar','nonpolar','polar','polar','polar','polar','nonpolar','nonpolar','polar')

## chemical
chemical<-c('aliphatic','sulfur','acidic','acidic','aromatic', 
            'aliphatic','basic','aliphatic','basic','aliphatic','sulfur', 
            'amide','aliphatic','amide','basic','hydroxyl','hydroxyl', 
            'aliphatic','aromatic','aromatic')

Using the defined variables from both the paper and Dr. Brouwer, we create a dataframe and call it “aa_dat”.

aa_dat <- data.frame(aa,MW.da,vol,bulk,pol,isoelec,H2Opho.34,H2Opho.35,saaH2O,faal.fold,polar.req,freq,charge,hydropathy,vol.cat,pol.cat,chemical)

Data Exploration

Now that we have some things set-up, we will begin to further explore the data. First, we start by producing a similarity matrix.

Similarity matrix

This will produce the matrix with all the proportion identical values.

cor_ <- round(cor(aa_dat[,-c(1,13:17)]),2) 
diag(cor_) <- NA
cor_[upper.tri(cor_)] <- NA
cor_
##           MW.da   vol  bulk   pol isoelec H2Opho.34 H2Opho.35 saaH2O faal.fold
## MW.da        NA    NA    NA    NA      NA        NA        NA     NA        NA
## vol        0.93    NA    NA    NA      NA        NA        NA     NA        NA
## bulk       0.55  0.73    NA    NA      NA        NA        NA     NA        NA
## pol        0.29  0.24 -0.20    NA      NA        NA        NA     NA        NA
## isoelec    0.20  0.36  0.08  0.27      NA        NA        NA     NA        NA
## H2Opho.34 -0.27 -0.08  0.44 -0.67   -0.20        NA        NA     NA        NA
## H2Opho.35 -0.25 -0.16  0.32 -0.85   -0.26      0.85        NA     NA        NA
## saaH2O     0.97  0.99  0.64  0.29    0.35     -0.18     -0.23     NA        NA
## faal.fold  0.11  0.18  0.49 -0.53   -0.18      0.84      0.79   0.12        NA
## polar.req -0.05 -0.19 -0.53  0.76   -0.11     -0.79     -0.87  -0.11     -0.81
## freq      -0.52 -0.30 -0.04 -0.01    0.02      0.26     -0.02  -0.38     -0.18
##           polar.req freq
## MW.da            NA   NA
## vol              NA   NA
## bulk             NA   NA
## pol              NA   NA
## isoelec          NA   NA
## H2Opho.34        NA   NA
## H2Opho.35        NA   NA
## saaH2O           NA   NA
## faal.fold        NA   NA
## polar.req        NA   NA
## freq           0.14   NA

The code below will show which variable has the max and min correlation and where at. We see that saaH2O and volume has the strongest positive correlation (0.99) and polar.req with H2Opho.35 has the most negative (-0.87) correlation. This determines how closely related they will appear on a diagram and in which trend.

which(cor_ == max(cor_, na.rm = T), arr.ind = T)
##        row col
## saaH2O   8   2
which(cor_ == min(cor_, na.rm = T), arr.ind = T)
##           row col
## polar.req  10   7

Plot 0: Scatterplot matrix

This code sets up the specific details of the scatterplot matrix featuring the plots on the lower left triangle and the correlations on the upper right. The diagonal contains the names of the variables I chose to analyze.

panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor,...) 
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits = digits)[1] 
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r) 
}

Now we plot the data for the specific subset of variables, MW.da, vol, pol, H2Opho.35, polar.req, saaH2O, and faal.fold.

plot(aa_dat[, c("MW.da","vol","pol","H2Opho.35","polar.req", "saaH2O", "faal.fold")],
     panel = panel.smooth,
      upper.panel =panel.cor ,
     main = "Scatterplot Matrix")

Here we see that the strongest correlation exists between saaH2O and vol (0.99) and will be a positive linear relationship. We see the least correlated is MW.da and polar.req (-0.05), resulting in a non-linear relationship. The closer the correlation value is to zero, the more non-linear it is. The closer it is to one, the more linear it is. The sign of the correlation value determines the trend of the data, either positively sloping or negatively.

Plot 1: Replicating Higgs and Attwood Figure 2.8

We start describing the relationship between the volume from van der Waals radii of amino acids and their respective isoelectric points by creating a plot.

plot(vol  ~ isoelec, data = aa_dat, #create plot
     xlab = "Isoelectric Point (pI)",
     ylab = "Volume",
     main = "Amino Acids and pI",
     col = 0)
text(vol ~ isoelec, #plotting data points
     labels = aa, 
     data = aa_dat, 
     col = 1:20)

Figure 2: Figure 2.8 with ggpubr

Now we recreate the figure above but with ggpubr and use the variables H2Opho.35 and freq to describe color and shape.

ggscatter(y = "vol", x = "isoelec",
size = "H2Opho.35", color = "freq", data = aa_dat,
xlab = "Isoelectric Point (pI)",
ylab = "Volume")

Figure 3: Highly correlated data

The variables MW.da and vol have a positive correlation of 0.93, which is a strong linear relationship. We consider an ellipse, which describes the axis of variation and the overall trend of the data, and see that it’s thinly stretched, showing the strong, positively correlated data. We also consider a line of best fit by following the formula y = B0 + B1*x, where B0 represents the intercept and B1 represents the coefficient. If B1 is negative, the data will follow negative slope. The code below carries out those representations.

ggscatter(y = "vol",
          x = "MW.da",
          size = "H2Opho.35",
          add  = "reg.line", #line of best fit
          ellipse = TRUE, #ellipse
          cor.coef = TRUE, #correlation coefficient
          data = aa_dat,
          xlab = "Molecular weight",
          ylab = "Volume")
## `geom_smooth()` using formula 'y ~ x'

Figure 4: Non-linear relationship

As mentioned earlier, we see that MW.da and polar.req are negatively correlated with a coefficient close to zero (-0.05). This implies a non-linear relationship and we can show that using the code below. I also add a smoother, which is the blue line that helps see the lack of correlation.

ggscatter(y = "MW.da",
          x = "polar.req",
          data = aa_dat,
          size = "freq",
          color = "saaH2O",
          add = "loess", #adds smoother through ggpubr
          xlab = "Polar requirement",
          ylab = "Molecular weight")
## `geom_smooth()` using formula 'y ~ x'

Figure 5: Non-linear relationship on LOG scale

Next we replicate the figure above, but on the log scale. This ensures smaller values to be used to represent the data being plotted to better understand the relationship.

aa_dat$MW.da_log <- log(aa_dat$MW.da) #creating a new column with MW.da
aa_dat$polar.req_log <- log(aa_dat$polar.req) #creating a new column with polar.reg

ggscatter(y = "MW.da_log",
          x = "polar.req_log",
          data = aa_dat,
          size = "freq",
          color = "saaH2O",
          add  = "reg.line", #line of best fit
          xlab = "polar.req_log",
          ylab = "MW.da_log",
          )
## `geom_smooth()` using formula 'y ~ x'

Figure 6: 3D Scatterplot

Using the variables polar.req, H2Opho.35, and pol, we build a 3D scatterplot to see the strong correlations. H2Opho.35 and polar.req have a linear correlation of (-0.87), H2Opho.35 and pol have a linear correlation of (-0.85), and polar.req and pol have a linear correlation of (0.76).

scatterplot3d(x = aa_dat$polar.req,
              y = aa_dat$H2Opho.35,
              z = aa_dat$pol,
              highlight.3d = TRUE,
              type = "h")

Principal Component Analysis

PCA makes it easier to see trends of data by grouping them into subcategories. This reduces the overall number of dimensions, which is the key component in being able to analyze variable relationships more efficiently and all at once. The article states, “The PCA method transforms this cloud of points first by scaling them and shifting them to the origin, and then by rotating them in such a way that the points are spread out as much as possible, and the structure in the data is made easier to see” (Higgs and Attwood 2005, pg 26).

We’ve seen in class that PCA does a descent job at showing such similarities. Using a variety of packages such as ggpubr, ggplot2, and vegan, we are able to carry out PCA analysis on mammal body part sizes or amino acid properties. We’ve done it by using both prcomp() and rda().

Set-up data

We create a new dataframe excluding the categorical variables. This is done by explicitly naming the variables and creating a new subset from the original data set.

aa_dat2 <- aa_dat[,c("vol","bulk","pol","isoelec","H2Opho.34","H2Opho.35","saaH2O","faal.fold")]

Figure 7: Principal components analysis - base R

This creates a PCA plot using base R commands.

pca.out <- prcomp(aa_dat2, scale = TRUE)
biplot(pca.out)

Figure 8: Principal components analysis - vegan

This will create PCA on the new data and stores it in rda.out.

rda.out <- vegan::rda(aa_dat2, scale = TRUE) #variables scaled to make interval smaller 

This extracts the PCA scores and stores it in rda_scores.

rda_scores <- scores(rda.out)

Now we use base R to create a PCA plot and set the group variable to the categorical variable “charge” instead of “hydropathy”. We see that it looks different from figure 2.10 in the sense that the three circle sizes are different. Both variables hydropathy and charge feature three categories, but hydropathy is more encompassing than charge, which features a strong predominance of the uncharged amino acid chains.

biplot(rda.out, display = "sites", col = 0,)
orditorp(rda.out, display = "sites", labels = aa_dat$aa, cex = 1.2) 
ordihull(rda.out, group = aa_dat$charge, col = 1:7, lty = 1:7, lwd = c(3,6))

Cluster analysis

Cluster analysis is important for seeing similarities and dissimilarities between different variables. The algorithm groups certain variables together which are similar so that we can see differentiated groupings. In this script, we will explore the data by using UPGMA.

Hierarchial cluster analysis

UPGMA (unweighted pair group method with arithmetic mean) is a clustering algorithm that can be used to create phylogenetic trees. It assumes the molecular clock hypothesis in the sense that it considers the rate of evolution as constant. It uses a multiple sequence alignment to create a distance matrix of the taxa being analyzed.

A distance matrix is a matrix of pairwise distances between sequences. They can be used to create either rooted or unrooted trees via different algorithms (UPGMA, WPGMA). It can also use euclidean distance as a common metric generalized by the Pythagorean Theorem. This is just a way to measure the distance between a set of points.

The code below creates a dendrogram of the data.

row.names(aa_dat2) = paste(aa_dat$aa, sep = ".")

dist_euc <- dist(aa_dat2, method = "euclidean") #calculates euclidean distance
clust_euc <- hclust(dist_euc, method = "average")  #average for UPGMA algorithm                   
dendro_euc <- as.dendrogram(clust_euc) #convert to dendrogram

plot(dendro_euc, horiz = T, nodePar = list(pch = c(1,NA), cex = 0.5, lab.cex = 0.5))

Here, we see a very similar dendrogram to the one produced in the Higgs and Attwood paper. In my diagram, there are two branches coming from the ancestral branch and it appears to create six clusters. The paper also appears to notice six similar groups of basic residues, acidic and amide, small, cysteine, hydrophobic, and aromatic residues.