This code examines properties of amino acids, such as volume, isoelectric point, polarity and others to find trends and correlations. It first creates a matrix to examine the correlations between different variables. Then, it plots different variables against each other to see trends. Next, it performs a PCA to reduce the dimensionality of the data and visualize the correlations between all variables. Fianlly, through cluster analysis, we can use the data to find relationship and trends between properties of amino acids to offer explanations for their behaviors in nature. ## Preliminaries
ggpubr and ggplot2 are packages that make it more intuitive to create good looking graphs. ggpubr provides functions for creating and customizing ggplot2-based plots.
library(ggplot2) # creating plots
library(ggpubr) # creating plots
library(vegan) # Community Ecology data
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.5-6
library(scatterplot3d) # creating 3D scatter plots
These data represent different properties of amino acids. They include measurements of the size (mass, volume, etc), qualities and measurements of polarity, and measurements regarding hydrophobicity. There are both numerical and categorical data.
These data are derived mostly from Higgs (2009) Table 1. They are similar to Higgs and Attwood 2005 but have extra columns. Nathan Brouwer added additional categorical variables based on online information.
# 1 letter code
aa <-c('A','C','D','E','F','G','H','I','K','L','M','N',
'P','Q','R','S','T','V','W','Y')
## molecular weight in dalton
MW.da <-c(89,121,133,146,165,75,155,131,146,131,149,132,115,147,174,105,119,117,204,181)
## vol from van der Waals radii
vol <-c(67,86,91,109,135,48,118,124,135,124,124,96,90,
114,148,73,93,105,163,141)
## bulk – a measure of the shape of the side chain
bulk <-c(11.5,13.46,11.68,13.57,19.8,3.4,13.69,21.4,
15.71,21.4,16.25,12.28,17.43,
14.45,14.28,9.47,15.77,21.57,21.67,18.03)
## pol – a measure of the electric field strength around the molecule
pol <-c(0,1.48,49.7,49.9,0.35,0,51.6,0.13,49.5,0.13,
1.43,3.38,1.58,3.53,52,1.67,1.66,0.13,2.1,1.61)
## isoelec point
isoelec <-c(6,5.07,2.77,3.22,5.48,5.97,7.59,6.02,9.74,5.98,
5.74,5.41,6.3,5.65,10.76,5.68,6.16,5.96,5.89,5.66)
## 1st Hydrophobicity scale
H2Opho.34 <-c(1.8,2.5,-3.5,-3.5,2.8,-0.4,-3.2,4.5,-3.9,3.8,1.9,
-3.5,-1.6,-3.5,-4.5,-0.8,-0.7,4.2,-0.9,-1.3)
## 2nd Hydrophobicity scale
H2Opho.35 <-c(1.6,2,-9.2,-8.2,3.7,1,-3,3.1,-8.8,2.8,3.4,-4.8,
-0.2,-4.1,-12.3,0.6,1.2,2.6,1.9,-0.7)
## Surface area accessible to water in an unfolded peptide
saaH2O <-c(113,140,151,183,218,85,194,182,211,180,204,158,
143,189,241,122,146,160,259,229)
## Fraction of accessible area lost when a protein folds
faal.fold <-c(0.74,0.91,0.62,0.62,0.88,0.72,0.78,0.88,0.52,
0.85,0.85,0.63,0.64,0.62,0.64,0.66,0.7,0.86,0.85,0.76)
# Polar requirement
polar.req <-c(7,4.8,13,12.5,5,7.9,8.4,4.9,10.1,4.9,5.3,10,
6.6,8.6,9.1,7.5,6.6,5.6,5.2,5.4)
## relative frequency of occurrence
## "The frequencies column shows the mean
### percentage of each amino acid in the protein sequences
### of modern organisms"
freq <-c(7.8,1.1,5.19,6.72,4.39,6.77,2.03,6.95,6.32,
10.15,2.28,4.37,4.26,3.45,5.23,6.46,5.12,7.01,1.09,3.3)
##
# charges
## un = Un-charged
## neg = negative
## pos = positive
charge<-c('un','un','neg','neg','un','un','pos','un','pos',
'un','un','un','un','un','pos','un','un','un','un','un')
## hydropathy
hydropathy<-c('hydrophobic','hydrophobic','hydrophilic','
hydrophilic','hydrophobic','neutral','neutral',
'hydrophobic','hydrophilic','hydrophobic','hydrophobic',
'hydrophilic','neutral',
'hydrophilic','hydrophilic','neutral','neutral',
'hydrophobic','hydrophobic','neutral')
## vol
vol.cat<-c('verysmall','small','small','medium',
'verylarge','verysmall','medium','large','large',
'large','large','small','small','medium','large',
'verysmall',
'small','medium','verylarge','verylarge')
## pol
pol.cat<-c('nonpolar','nonpolar','polar','polar',
'nonpolar','nonpolar','polar','nonpolar',
'polar','nonpolar','nonpolar','polar','nonpolar','polar',
'polar','polar','polar','nonpolar','nonpolar','polar')
## chemical
chemical<-c('aliphatic','sulfur','acidic','acidic','aromatic',
'aliphatic','basic','aliphatic','basic','aliphatic','sulfur',
'amide','aliphatic','amide','basic','hydroxyl','hydroxyl',
'aliphatic','aromatic','aromatic')
Building data frame
aa_dat <- data.frame(aa,MW.da,vol,bulk,
pol,isoelec,H2Opho.34,H2Opho.35,
saaH2O,faal.fold, polar.req,freq,charge,hydropathy,vol.cat,pol.cat,chemical)
cor_ <- round(cor(aa_dat[,-c(1,13:17)]),2)
diag(cor_) <- NA
cor_[upper.tri(cor_)] <- NA
cor_
## MW.da vol bulk pol isoelec H2Opho.34 H2Opho.35 saaH2O faal.fold
## MW.da NA NA NA NA NA NA NA NA NA
## vol 0.93 NA NA NA NA NA NA NA NA
## bulk 0.55 0.73 NA NA NA NA NA NA NA
## pol 0.29 0.24 -0.20 NA NA NA NA NA NA
## isoelec 0.20 0.36 0.08 0.27 NA NA NA NA NA
## H2Opho.34 -0.27 -0.08 0.44 -0.67 -0.20 NA NA NA NA
## H2Opho.35 -0.25 -0.16 0.32 -0.85 -0.26 0.85 NA NA NA
## saaH2O 0.97 0.99 0.64 0.29 0.35 -0.18 -0.23 NA NA
## faal.fold 0.11 0.18 0.49 -0.53 -0.18 0.84 0.79 0.12 NA
## polar.req -0.05 -0.19 -0.53 0.76 -0.11 -0.79 -0.87 -0.11 -0.81
## freq -0.52 -0.30 -0.04 -0.01 0.02 0.26 -0.02 -0.38 -0.18
## polar.req freq
## MW.da NA NA
## vol NA NA
## bulk NA NA
## pol NA NA
## isoelec NA NA
## H2Opho.34 NA NA
## H2Opho.35 NA NA
## saaH2O NA NA
## faal.fold NA NA
## polar.req NA NA
## freq 0.14 NA
This code returns variables with the most positive and negative correlations and their locations in the matrix. saaH2O and volume have the most positive correlation, and polar.req and hydrophobe.35 have most negative correlation.
which(cor_ == max(cor_, na.rm = T), arr.ind = T)
## row col
## saaH2O 8 2
This code builds and adds each scatterplot and correlation coefficient to the matrix.
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor,...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r)
}
plot(~MW.da+vol+pol+H2Opho.34+polar.req+isoelec,
panel = panel.smooth,
upper.panel = panel.cor)
Strong correlations exist between MW.da and vol, polar.req and H2Oph.34, and pol and polar.req. There are weak correlations between vol and H2Opho.34, MW.da and polar.req, and isoelec and MW.da. There is a nonlinear correlation between MW.da and H2Opho.34.
This is a plot of amino acid volume vs. pI which are thought to be important in protein folding, with the different points representing amino acids.
plot(vol ~ isoelec, data = aa_dat, # main plotting function
xlab = "pI",
ylab = "Volume (Å)",
main = "Amino Acid pI vs Volume",
col = 0)
text(vol ~ isoelec, # adds text label
labels = aa,
data = aa_dat,
col = 1:20)
ggscatter(y = "vol", x = "isoelec",
size = "bulk",
color = "pol",
data = aa_dat,
xlab = "pI",
ylab = "Volume")
ggscatter(y = "MW.da",
x = "vol",
size = "saaH2O",
color = "faal.fold",
data = aa_dat,
xlab = "Volume",
ylab = "Molecular Weight (daltons)",
add = "loess")
## `geom_smooth()` using formula 'y ~ x'
The log transformation transform skewed data to approximately conform to normality by taking the log of the values of the variables.
MW.da_log <- log(MW.da)
vol_log <- log(vol)
aa_dat_log <- data.frame(MW.da_log,vol_log,saaH2O,faal.fold)
ggscatter(y = "MW.da_log",
x = "vol_log",
size = "saaH2O",
color = "faal.fold",
data = aa_dat_log,
xlab = "Volume",
ylab = "Molecular Weight (daltons)",
cor.coef = TRUE)
There appears to be a positive correlation between volume and MW, surface area and MW, and surface area and volume. There does not appear to be a very clear correlation between faal.fold and the other variables.
scatterplot3d(x = saaH2O,
y = vol,
z = H2Opho.34,
highlight.3d = TRUE,
type="h")
Principal component analysis (PCA) is a tool to reduce the dimensions of data, and to visualize the similarities between things while filtering out noise. This is done by transforming the data points “first by scaling them and shifting them to the origin, and then by rotating them in such a way that the points are spread out as much as possible”(Higgs and Attwood 2005, pg 26).
The 2 basic ways we made PCA plots in this class are with the base R prcomp() and biplot() functions, which makes a 2D PCA plot. This is a simple, but not very flexible, way to make pca plots. Alternatively, we also use the rda() function in the R package vegan which does PCA and has other nice features not in the base R function.
This data frame drops the categorical variables from the original data frame, leaving only the numeric variables like in the original Higgs and Attwood table.
aa_dat2 <- data.frame(MW.da,vol,bulk,
pol,isoelec,H2Opho.34,H2Opho.35,
saaH2O,faal.fold, polar.req,freq)
pca.out <- prcomp(aa_dat2, scale = TRUE)
biplot(pca.out)
scale = TRUE scales species to unit variance (like correlations).
rda.out <- vegan::rda(aa_dat2, scale = TRUE)
rda_scores <- scores(rda.out)
TODO: The plot in Higgs and Atwood has similar groupings such as F with M, V, L, and I, and R with K. However, this plot is slightly more compressed as it is scaled differently. Also, the clustering of categorical polarity data (polar vs nonpolar) is represented in this plot, showing some overlap as in Higgs and Atwood, but a fairly clear separation.
biplot(rda.out, display = "sites", col = 0,)
orditorp(rda.out, display = "sites", labels = aa_dat$aa, cex = 1.2)
ordihull(rda.out,
group = aa_dat$pol.cat, col = 1:7,
lty = 1:7,
lwd = c(3,6))
Cluster analysis is used to place data points into groups or clusters, based on the data.The objects in a cluster tend to be similar to each other in some sense, and objects in different clusters tend to be dissimilar.
To build a UPGMA tree, you first construct a distance matrix and cluster a pair taxa by shortest distance. Then, recalculate a new average distance with the new cluster and other taxa, and make a new distance matrix. Repeat this until only two clusters remain.
A distance matrix is a two-dimensional array filled with the distances between the elements of a set. , The distances are calculated pairwise.
Euclidean distance is between two points in Euclidean space as a number, the length of a line segment between the two points. It can be calculated from the Cartesian coordinates of the points using the Pythagorean theorem.
Compared to the cluster diagram produced by Higgs and Atwood (2005) in Plate 2.2, the one produced here is similar in that it pairs I with L, P with T, and a few other equal pairings. It differs with many of the pairings however. Higgs and Atwood used CLUTO to construct the cluster analysis, as well as implementing a heat map with a second tree of the variables.
row.names(aa_dat2) <- paste(aa_dat$aa,1:nrow(aa_dat),sep = ".")
dist_euc <- dist(aa_dat2,
method = "euclidean")
clust_euc <- hclust(dist_euc,
method = "average") # specifies UPGMA method
plot(clust_euc)