We attempt to replicate Higgs and Attwood’s analysis on various properties of amino acids on a different set of data (Higgs 2009). We build data frames based on the data, and from there visualize our data in different ways in order to find patterns and relationships. We start with making a correlation matrix which allows us to see the strongest correlated and weakest correlated variables. We then visualize the variables that appear to have a linear relationship and ones that don’t to examine their relationship more closely. Finally, we conduct principal component analysis and cluster analysis on the amino acid data.
Learning how different properties of amino acids are related is biologically important because it is thought to be important for determining how the resulting protein might fold, allowing us to predict protein structure based on amino acid sequences and etcetra.
## Loading required package: permute
library(permute)
## Loading required package: Lattice
library(lattice)
## This is vegan 2.5-6
library(vegan)
## This is vegan 2.5-6
These data are derived mostly from Higgs (2009) Table 1. They are similar to Higgs and Attwood 2005 but have extra columns. Nathan Brouwer added additional categorical varibles based on online information. Make the vectors. Data represents different properties of the 20 amino acids.
# 1 letter code
aa <-c('A','C','D','E','F','G','H','I','K','L','M','N',
'P','Q','R','S','T','V','W','Y')
## molecular weight in dalton
MW.da <-c(89,121,133,146,165,75,155,131,146,131,149,132,115,147,174,105,119,117,204,181)
## vol from van der Waals radii
vol <-c(67,86,91,109,135,48,118,124,135,124,124,96,90,
114,148,73,93,105,163,141)
## bulk – a measure of the shape of the side chain
bulk <-c(11.5,13.46,11.68,13.57,19.8,3.4,13.69,21.4,
15.71,21.4,16.25,12.28,17.43,
14.45,14.28,9.47,15.77,21.57,21.67,18.03)
## pol – a measure of the electric field strength around the molecule
pol <-c(0,1.48,49.7,49.9,0.35,0,51.6,0.13,49.5,0.13,
1.43,3.38,1.58,3.53,52,1.67,1.66,0.13,2.1,1.61)
## isoelec point
isoelec <-c(6,5.07,2.77,3.22,5.48,5.97,7.59,6.02,9.74,5.98,
5.74,5.41,6.3,5.65,10.76,5.68,6.16,5.96,5.89,5.66)
## 1st Hydrophobicity scale
H2Opho.34 <-c(1.8,2.5,-3.5,-3.5,2.8,-0.4,-3.2,4.5,-3.9,3.8,1.9,
-3.5,-1.6,-3.5,-4.5,-0.8,-0.7,4.2,-0.9,-1.3)
## 2nd Hydrophobicity scale
H2Opho.35 <-c(1.6,2,-9.2,-8.2,3.7,1,-3,3.1,-8.8,2.8,3.4,-4.8,
-0.2,-4.1,-12.3,0.6,1.2,2.6,1.9,-0.7)
## Surface area accessible to water in an unfolded peptide
saaH2O <-c(113,140,151,183,218,85,194,182,211,180,204,158,
143,189,241,122,146,160,259,229)
## Fraction of accessible area lost when a protein folds
faal.fold <-c(0.74,0.91,0.62,0.62,0.88,0.72,0.78,0.88,0.52,
0.85,0.85,0.63,0.64,0.62,0.64,0.66,0.7,0.86,0.85,0.76)
# Polar requirement
polar.req <-c(7,4.8,13,12.5,5,7.9,8.4,4.9,10.1,4.9,5.3,10,
6.6,8.6,9.1,7.5,6.6,5.6,5.2,5.4)
## relative frequency of occurance
## "The frequencies column shows the mean
### percentage of each amino acid in the protein sequences
### of modern organisms"
freq <-c(7.8,1.1,5.19,6.72,4.39,6.77,2.03,6.95,6.32,
10.15,2.28,4.37,4.26,3.45,5.23,6.46,5.12,7.01,1.09,3.3)
##
# charges
## un = Un-charged
## neg = negative
## pos = positive
charge<-c('un','un','neg','neg','un','un','pos','un','pos',
'un','un','un','un','un','pos','un','un','un','un','un')
## hydropathy
hydropathy<-c('hydrophobic','hydrophobic','hydrophilic','
hydrophilic','hydrophobic','neutral','neutral',
'hydrophobic','hydrophilic','hydrophobic','hydrophobic',
'hydrophilic','neutral',
'hydrophilic','hydrophilic','neutral','neutral',
'hydrophobic','hydrophobic','neutral')
## vol
vol.cat<-c('verysmall','small','small','medium',
'verylarge','verysmall','medium','large','large',
'large','large','small','small','medium','large',
'verysmall',
'small','medium','verylarge','verylarge')
## pol
pol.cat<-c('nonpolar','nonpolar','polar','polar',
'nonpolar','nonpolar','polar','nonpolar',
'polar','nonpolar','nonpolar','polar','nonpolar','polar',
'polar','polar','polar','nonpolar','nonpolar','polar')
## chemical
chemical<-c('aliphatic','sulfur','acidic','acidic','aromatic',
'aliphatic','basic','aliphatic','basic','aliphatic','sulfur',
'amide','aliphatic','amide','basic','hydroxyl','hydroxyl',
'aliphatic','aromatic','aromatic')
Build the data frames.
aa_dat <- data.frame(aa,MW.da,vol,bulk,pol,isoelec, H2Opho.34,
H2Opho.35, saaH2O, faal.fold, polar.req, freq,
charge, hydropathy, vol.cat, pol.cat, chemical)
cor_ <- round(cor(aa_dat[,-c(1,13:17)]),2)
diag(cor_) <- NA
cor_[upper.tri(cor_)] <- NA
cor_
## MW.da vol bulk pol isoelec H2Opho.34 H2Opho.35 saaH2O faal.fold
## MW.da NA NA NA NA NA NA NA NA NA
## vol 0.93 NA NA NA NA NA NA NA NA
## bulk 0.55 0.73 NA NA NA NA NA NA NA
## pol 0.29 0.24 -0.20 NA NA NA NA NA NA
## isoelec 0.20 0.36 0.08 0.27 NA NA NA NA NA
## H2Opho.34 -0.27 -0.08 0.44 -0.67 -0.20 NA NA NA NA
## H2Opho.35 -0.25 -0.16 0.32 -0.85 -0.26 0.85 NA NA NA
## saaH2O 0.97 0.99 0.64 0.29 0.35 -0.18 -0.23 NA NA
## faal.fold 0.11 0.18 0.49 -0.53 -0.18 0.84 0.79 0.12 NA
## polar.req -0.05 -0.19 -0.53 0.76 -0.11 -0.79 -0.87 -0.11 -0.81
## freq -0.52 -0.30 -0.04 -0.01 0.02 0.26 -0.02 -0.38 -0.18
## polar.req freq
## MW.da NA NA
## vol NA NA
## bulk NA NA
## pol NA NA
## isoelec NA NA
## H2Opho.34 NA NA
## H2Opho.35 NA NA
## saaH2O NA NA
## faal.fold NA NA
## polar.req NA NA
## freq 0.14 NA
Locate the indices where the maximum correlation (most positive) is found (not considering missing values).
which(cor_ == max(cor_, na.rm = T), arr.ind = T)
## row col
## saaH2O 8 2
cor_[which(cor_ == max(cor_, na.rm = T), arr.ind = T)]
## [1] 0.99
We found that the strongest positive correlation is between saaH2O and volume, with a correlation of 0.99 at row 8 column 2. Locate the indices where the minimum correlation (most negative) is found (not considering missing values)
which(cor_ == min(cor_, na.rm = T), arr.ind = T)
## row col
## polar.req 10 7
cor_[which(cor_ == min(cor_, na.rm = T), arr.ind = T)]
## [1] -0.87
The strongest negative correlation is between polar requirement and 2nd hydrophobicity scale with a correlation of -0.87 at row 10 column 7.
A function is created to help format the scatterplot matrix. Takes input of two variables that would be from the data frame we previous created.
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r)
}
# building a sub-dataframe from the main data, including only the variables that we are considering
aa_matrix <- data.frame(MW.da, vol, pol, H2Opho.34, polar.req, saaH2O, faal.fold)
# plotting
plot(aa_matrix,upper.panel = panel.cor,
panel = panel.smooth,
main = "Scatterplot Matrix")
From the scatterplot matrix, we could see that these pairs of variables shows strong positive correlation: molecular weight and volume, molecular weight and surface area accessible by water, and volume and surface area accessible by water. All these pairs are relatively expected, as the larger an amino acid is, the more atoms it is likely to contain and the heavier its molecular weight would be. There also exists a ratio between the volume and surface area that is essentially linear positive in the scope we consider. On the other hand, surface area accessible by water seems to share non-linear relationship all other variables excluding MW and volume.
plot(vol ~ isoelec, #examining the relationship between these two variables
data = aa_dat, #the respective columns comes from our dataframe
xlab = "pI", #adds x label
ylab = "Volume(Å3)", #adds y label
main = "Amino acide volume vs. pI", #adds title
col = 0)
#adding text to a plot
text(vol ~ isoelec,
labels = aa, #using amino acid single letter abbreviations as label
data = aa_dat,
col = 1:20 #20 colors to be used for different texts
)
The plot visualizes the relationship between amino acid volume against isoelectric point through all 20 amino acids.
ggpubr::ggscatter(y = "polar.req",
x = "freq",
size = "polar.req",
color = "polar.req",
data = aa_dat,
xlab = "Frequency",
ylab = "Polar Requirement")
ggpubr::ggscatter(x = "faal.fold",
y = "pol",
size = "pol",
shape = 18,
color = "pol",
data = aa_dat,
ylab = "Polarity",
xlab = "Fraction of accessible area lost upon fold")
ggpubr::ggscatter(x = "saaH2O",
y = "polar.req",
data = aa_dat,
add = "loess", #adds a loess smoother
size = "pol",
color = "faal.fold",
xlab = "Surface area accessible by water",
ylab = "Polar requirement")
## `geom_smooth()` using formula 'y ~ x'
aa_dat$saaH2O_log <- log(aa_dat$saaH2O)
aa_dat$polar.req_log <- log(aa_dat$polar.req)
ggpubr::ggscatter(x = "saaH2O_log",
y = "polar.req_log",
data = aa_dat,
add = "reg.line", #adds a regression line
size = "pol",
color = "faal.fold",
xlab = "Log surface area accessible by water",
ylab = "Log polar requirement")
## `geom_smooth()` using formula 'y ~ x'
The log transformation is used to create allometric data from the non-linear relationship by re-scaling the two variables and making relationships linear.
library(scatterplot3d)
aa_3d <- scatterplot3d(x = aa_dat$H2Opho.34,
y = aa_dat$faal.fold,
z = aa_dat$polar.req,
xlab = "H2Opho.34",
ylab = "faal.fold",
zlab = "polar.req",
highlight.3d = T,
type = "h"
)
Polar requirement has a moderate negative correlation with both 1st hydrophobicity scale and fraction of area lost after folding. On the other hand, 1st hydrophobicity scale and fraction of area lost after folding shows positive correlation.
PCA is the process of reducing high dimensional data to lower dimensions (typically 2D) to examine part of the variance of multiple variables and makes such visualizations easier to interpret. As illustrated by Higgs and Attwood, the method transforms the cloud of data by “scaling them and shifting them to the origin, and then by rotating them in such a way that the points are spread out as much as possible” (Higgs and Attwood 2005, pg 26). The resulting graph shows different distance vectors for different variables, showing the covariance between each pair.
Two basic ways we made PCA plots are using base R and vegan. For both methods, we pass a matrix of NxP dimension where N is the number of objects in dataset and P is the number of properties we’re considering (Higgs and Attwood 2005), and the data set is scaled and shifted as described above.
Using the column index, we select only the first and third through tenth columns in original data set and make it a sub-data set.
aa_dat2 <- aa_dat[c(3:10)] # removing extra variables MW.da, polar.req and freq
row.names(aa_dat2) <- aa_dat$aa # assign row names
pca.out <- prcomp(aa_dat2, scale. = T)
biplot(pca.out)
Setting scale to true in the vegan function enables scaling of the variables.
rda.out <- vegan::rda(aa_dat2, scale = T)
rda_scores <- scores(rda.out)
biplot(rda.out, display = "sites", col = 0)
orditorp(rda.out, display = "sites", labels = aa_dat$aa, cex = 1.2)
ordihull(rda.out,
group = aa_dat$hydropathy,
col = 1:7,
lty = 1:7,
lwd = c(3,6))
Cluster analysis is another way to look at relationships in large multivariable data sets. It is essentially the clustering of similar data points in order to determine the patterns in the data set, then building a dendrogram based off the data.
UPGMA is one such cluster analysis method (specifically a hierarchical cluster analysis) where we start by taking the original data to create a distance matrix. In this case, we account for the pair-wise quantitative difference (Euclidean distance) between the 20 amino acids. Then starting from the most similar amino acids, we iteratively combine the nearest two clusters into a larger cluster until all groups are included.
A distance matrix is a matrix that shows pairwise difference for all the objects. In this case we’re looking at differences of the 20 amino acids across the 8 numeric data categories.
Euclidean distance is quite literally the distance between two points in an Euclidean space. Here we consider the geometric distance between the two clusters.
dist_euc <- dist(aa_dat2,
method = "euclidean")
clust_euc <- hclust(dist_euc,
method = "average") # setting method to average changes the agglomeration method to UPGMA
dendro_euc <- as.dendrogram(clust_euc)
plot(dendro_euc,horiz = T,
nodePar = list(pch = c(1,NA),
cex = 0.5,
lab.cex = 0.5))
The dendrogram produces a different grouping pattern or different topology as compared to the cluster diagram produced by Higgs and Attwood. In example, A and S are very closely related in the dentrogram, but in the PCA cluster diagram they are in different clusters (green and blue). In addition, A, C, V, I, L, M, F, W are included in one cluster in the PCA diagram, while A, S, G, D are considered more similar to each other. In addition, whereas in PCA diagram we see a general mapping of amino acids with more similar properties, in the dendrogram we see even more complex relationships between them.