This script replicates Higg’s and Attwood’s analysis on amino acids. Here, we will explore amino acids and their functionsin protein folding through bioinformatics in R, specifically using scatterplots, Principal Component Analysis (PCA), and Hierarchical Cluster Analysis using the vegan package.
Packages
#a package used for data visualization
library(ggplot2)
#allows for 'ggplot2' customization; a "wrapper" for 'ggplot2' package
library(ggpubr)
#a set of restricted permtation designes
library(permute)
#an add on package for data visualization
library(lattice)
#a package commonly used by community ecologists for multivarate analysis
library(vegan)
## This is vegan 2.5-6
#used to make 3-D scatterplots
library(scatterplot3d)
ggplot2 is a package used for data visualization, and ggpubr is a “wrapper” that allows for even more functions to the ggplot2 package
These data are derived mostly from Higgs (2009) Table 1. They are similar to Higgs and Attwood 2005 but have extra columns. Nathan Brouwer added additional categorical varibles based on online information. The data here display characteristics of amino acids.
# 1 letter code
aa <-c('A','C','D','E','F','G','H','I','K','L','M','N',
'P','Q','R','S','T','V','W','Y')
## molecular weight in dalton
MW.da <-c(89,121,133,146,165,75,155,131,146,131,149,132,115,147,174,105,119,117,204,181)
## vol from van der Waals radii
vol <-c(67,86,91,109,135,48,118,124,135,124,124,96,90,
114,148,73,93,105,163,141)
## bulk – a measure of the shape of the side chain
bulk <-c(11.5,13.46,11.68,13.57,19.8,3.4,13.69,21.4,
15.71,21.4,16.25,12.28,17.43,
14.45,14.28,9.47,15.77,21.57,21.67,18.03)
## pol – a measure of the electric field strength around the molecule
pol <-c(0,1.48,49.7,49.9,0.35,0,51.6,0.13,49.5,0.13,
1.43,3.38,1.58,3.53,52,1.67,1.66,0.13,2.1,1.61)
## isoelec point
isoelec <-c(6,5.07,2.77,3.22,5.48,5.97,7.59,6.02,9.74,5.98,
5.74,5.41,6.3,5.65,10.76,5.68,6.16,5.96,5.89,5.66)
## 1st Hydrophobicity scale
H2Opho.34 <-c(1.8,2.5,-3.5,-3.5,2.8,-0.4,-3.2,4.5,-3.9,3.8,1.9,
-3.5,-1.6,-3.5,-4.5,-0.8,-0.7,4.2,-0.9,-1.3)
## 2nd Hydrophobicity scale
H2Opho.35 <-c(1.6,2,-9.2,-8.2,3.7,1,-3,3.1,-8.8,2.8,3.4,-4.8,
-0.2,-4.1,-12.3,0.6,1.2,2.6,1.9,-0.7)
## Surface area accessible to water in an unfolded peptide
saaH2O <-c(113,140,151,183,218,85,194,182,211,180,204,158,
143,189,241,122,146,160,259,229)
## Fraction of accessible area lost when a protein folds
faal.fold <-c(0.74,0.91,0.62,0.62,0.88,0.72,0.78,0.88,0.52,
0.85,0.85,0.63,0.64,0.62,0.64,0.66,0.7,0.86,0.85,0.76)
# Polar requirement
polar.req <-c(7,4.8,13,12.5,5,7.9,8.4,4.9,10.1,4.9,5.3,10,
6.6,8.6,9.1,7.5,6.6,5.6,5.2,5.4)
## relative frequency of occurance
## "The frequencies column shows the mean
### percentage of each amino acid in the protein sequences
### of modern organisms"
freq <-c(7.8,1.1,5.19,6.72,4.39,6.77,2.03,6.95,6.32,
10.15,2.28,4.37,4.26,3.45,5.23,6.46,5.12,7.01,1.09,3.3)
# charges
## un = Un-charged
## neg = negative
## pos = positive
charge<-c('un','un','neg','neg','un','un','pos','un','pos','un','un','un','un','un','pos','un','un','un','un','un')
## hydropathy
hydropathy<-c('hydrophobic','hydrophobic','hydrophilic','
hydrophilic','hydrophobic','neutral','neutral',
'hydrophobic','hydrophilic','hydrophobic','hydrophobic',
'hydrophilic','neutral',
'hydrophilic','hydrophilic','neutral','neutral',
'hydrophobic','hydrophobic','neutral')
## vol
vol.cat<-c('verysmall','small','small','medium',
'verylarge','verysmall','medium','large','large',
'large','large','small','small','medium','large',
'verysmall',
'small','medium','verylarge','verylarge')
## pol
pol.cat<-c('nonpolar','nonpolar','polar','polar',
'nonpolar','nonpolar','polar','nonpolar',
'polar','nonpolar','nonpolar','polar','nonpolar','polar',
'polar','polar','polar','nonpolar','nonpolar','polar')
## chemical
chemical<-c('aliphatic','sulfur','acidic','acidic','aromatic',
'aliphatic','basic','aliphatic','basic','aliphatic','sulfur',
'amide','aliphatic','amide','basic','hydroxyl','hydroxyl',
'aliphatic','aromatic','aromatic')
log_vol = log(vol)
log_polar_req = log(polar.req)
aa_dat <- data.frame(aa,MW.da,vol,bulk,
pol,isoelec,H2Opho.34,H2Opho.35,
saaH2O,faal.fold, polar.req,freq,charge,hydropathy,vol.cat,pol.cat,chemical, log_vol, log_polar_req)
aa_dat
## aa MW.da vol bulk pol isoelec H2Opho.34 H2Opho.35 saaH2O faal.fold
## 1 A 89 67 11.50 0.00 6.00 1.8 1.6 113 0.74
## 2 C 121 86 13.46 1.48 5.07 2.5 2.0 140 0.91
## 3 D 133 91 11.68 49.70 2.77 -3.5 -9.2 151 0.62
## 4 E 146 109 13.57 49.90 3.22 -3.5 -8.2 183 0.62
## 5 F 165 135 19.80 0.35 5.48 2.8 3.7 218 0.88
## 6 G 75 48 3.40 0.00 5.97 -0.4 1.0 85 0.72
## 7 H 155 118 13.69 51.60 7.59 -3.2 -3.0 194 0.78
## 8 I 131 124 21.40 0.13 6.02 4.5 3.1 182 0.88
## 9 K 146 135 15.71 49.50 9.74 -3.9 -8.8 211 0.52
## 10 L 131 124 21.40 0.13 5.98 3.8 2.8 180 0.85
## 11 M 149 124 16.25 1.43 5.74 1.9 3.4 204 0.85
## 12 N 132 96 12.28 3.38 5.41 -3.5 -4.8 158 0.63
## 13 P 115 90 17.43 1.58 6.30 -1.6 -0.2 143 0.64
## 14 Q 147 114 14.45 3.53 5.65 -3.5 -4.1 189 0.62
## 15 R 174 148 14.28 52.00 10.76 -4.5 -12.3 241 0.64
## 16 S 105 73 9.47 1.67 5.68 -0.8 0.6 122 0.66
## 17 T 119 93 15.77 1.66 6.16 -0.7 1.2 146 0.70
## 18 V 117 105 21.57 0.13 5.96 4.2 2.6 160 0.86
## 19 W 204 163 21.67 2.10 5.89 -0.9 1.9 259 0.85
## 20 Y 181 141 18.03 1.61 5.66 -1.3 -0.7 229 0.76
## polar.req freq charge hydropathy vol.cat pol.cat chemical log_vol
## 1 7.0 7.80 un hydrophobic verysmall nonpolar aliphatic 4.204693
## 2 4.8 1.10 un hydrophobic small nonpolar sulfur 4.454347
## 3 13.0 5.19 neg hydrophilic small polar acidic 4.510860
## 4 12.5 6.72 neg \nhydrophilic medium polar acidic 4.691348
## 5 5.0 4.39 un hydrophobic verylarge nonpolar aromatic 4.905275
## 6 7.9 6.77 un neutral verysmall nonpolar aliphatic 3.871201
## 7 8.4 2.03 pos neutral medium polar basic 4.770685
## 8 4.9 6.95 un hydrophobic large nonpolar aliphatic 4.820282
## 9 10.1 6.32 pos hydrophilic large polar basic 4.905275
## 10 4.9 10.15 un hydrophobic large nonpolar aliphatic 4.820282
## 11 5.3 2.28 un hydrophobic large nonpolar sulfur 4.820282
## 12 10.0 4.37 un hydrophilic small polar amide 4.564348
## 13 6.6 4.26 un neutral small nonpolar aliphatic 4.499810
## 14 8.6 3.45 un hydrophilic medium polar amide 4.736198
## 15 9.1 5.23 pos hydrophilic large polar basic 4.997212
## 16 7.5 6.46 un neutral verysmall polar hydroxyl 4.290459
## 17 6.6 5.12 un neutral small polar hydroxyl 4.532599
## 18 5.6 7.01 un hydrophobic medium nonpolar aliphatic 4.653960
## 19 5.2 1.09 un hydrophobic verylarge nonpolar aromatic 5.093750
## 20 5.4 3.30 un neutral verylarge polar aromatic 4.948760
## log_polar_req
## 1 1.945910
## 2 1.568616
## 3 2.564949
## 4 2.525729
## 5 1.609438
## 6 2.066863
## 7 2.128232
## 8 1.589235
## 9 2.312535
## 10 1.589235
## 11 1.667707
## 12 2.302585
## 13 1.887070
## 14 2.151762
## 15 2.208274
## 16 2.014903
## 17 1.887070
## 18 1.722767
## 19 1.648659
## 20 1.686399
Raw data exploration Correlation Matrix
cor_ <- round(cor(aa_dat[,-c(1,13:17)]),2)
diag(cor_) <- NA
cor_[upper.tri(cor_)] <- NA
cor_
## MW.da vol bulk pol isoelec H2Opho.34 H2Opho.35 saaH2O
## MW.da NA NA NA NA NA NA NA NA
## vol 0.93 NA NA NA NA NA NA NA
## bulk 0.55 0.73 NA NA NA NA NA NA
## pol 0.29 0.24 -0.20 NA NA NA NA NA
## isoelec 0.20 0.36 0.08 0.27 NA NA NA NA
## H2Opho.34 -0.27 -0.08 0.44 -0.67 -0.20 NA NA NA
## H2Opho.35 -0.25 -0.16 0.32 -0.85 -0.26 0.85 NA NA
## saaH2O 0.97 0.99 0.64 0.29 0.35 -0.18 -0.23 NA
## faal.fold 0.11 0.18 0.49 -0.53 -0.18 0.84 0.79 0.12
## polar.req -0.05 -0.19 -0.53 0.76 -0.11 -0.79 -0.87 -0.11
## freq -0.52 -0.30 -0.04 -0.01 0.02 0.26 -0.02 -0.38
## log_vol 0.91 0.98 0.77 0.25 0.31 -0.07 -0.17 0.96
## log_polar_req -0.09 -0.22 -0.58 0.74 -0.03 -0.82 -0.86 -0.15
## faal.fold polar.req freq log_vol log_polar_req
## MW.da NA NA NA NA NA
## vol NA NA NA NA NA
## bulk NA NA NA NA NA
## pol NA NA NA NA NA
## isoelec NA NA NA NA NA
## H2Opho.34 NA NA NA NA NA
## H2Opho.35 NA NA NA NA NA
## saaH2O NA NA NA NA NA
## faal.fold NA NA NA NA NA
## polar.req -0.81 NA NA NA NA
## freq -0.18 0.14 NA NA NA
## log_vol 0.17 -0.16 -0.29 NA NA
## log_polar_req -0.86 0.99 0.14 -0.2 NA
The code below uses the which() function to determine the maximum correlation value in the data frame and removes any NA values in the calculations. The most positive correlation is the saaH2O and volume, while the least is the polar.req and hydrophobe.35.
which(cor_ == max(cor_, na.rm = T), arr.ind = T)
## row col
## saaH2O 8 2
## log_polar_req 13 10
which(cor_ == min(cor_, na.rm = T), arr.ind = T)
## row col
## polar.req 10 7
Plot 0: Scatterplot matrix
pairs(aa_dat[2:8])
The function below creates lines of best fit in the scatterplot correlation matrix to show relatedness of the variables.
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor,...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r)
}
On the plots of the six variables plotted, MW.da vs. vol have a linear relationship. The other variables have no clear relationship. H2Opho.34 and polar.req have a linear relationship.
Plot 1: Replication Higgs and Attwood Figure 2.8
THe plot indicates the correlation between amino acid volume and polarity, which is coded for by the main plotting.
Correlation scatterplot matrix
plot(polar.req ~ freq, data = aa_dat,
xlab = "Polar requirement",
ylab = "Volume (A^3)",
main = "Amino Acid vol vs. pI",
col = 0)
text(polar.req ~ freq,
labels = aa,
data = aa_dat,
col = 1:20)
Figure 2: HA Figure 2.8 with ggpubr
ggscatter(y = "polar.req",
x = "freq",
size = "polar.req",
color = "polar.req",
data = aa_dat,
xlab = "freq",
ylab = "polar.req")
Figure 3: Highly correlated data Plot 2 variables with positive correlation using ggscatter to plot volume and MW.da.
ggscatter(y = "vol",
x = "MW.da",
add = "reg.line", #Line of best fit
ellipse = TRUE, #Data ellipse
cor.coef = TRUE, #Correlation coefficient
data = aa_dat,
xlab = "MW.da",
ylab = "Volume")
## `geom_smooth()` using formula 'y ~ x'
The line of best fit has an equation of N = B0 +B1T, and the ellipse represents the standard deviation.
Figure 4: Non-linear relationship Using a non-linear relationship to represent the data, we will use a loess smoother to smooth the line.
ggscatter(data = aa_dat,
y = "vol",
x = "polar.req",
size = "MW.da",
color = "pol",
add = "loess") +
xlab("polar requirment") +
ylab("Volume")
## `geom_smooth()` using formula 'y ~ x'
Figure 5: Non-linear relationship on LOG scale Using a log scale, the pI and volume correlation can be examined to determine the correlation. The log scale makes the data more linear for the volume and pI data.
ggscatter(data = aa_dat,
y = "log_vol",
x = "log_polar_req",
size = "bulk",
color = "freq",
add = "reg.line") +
xlab("log(pI)") +
ylab("log(Volume)")
## `geom_smooth()` using formula 'y ~ x'
Figure 6: 3D Scatterplot The data doesn’t indicate much correlation, as there aren’t cluster in the data.
scatterplot3d(x = aa_dat$polar.req,
y = aa_dat$MW.da,
z = aa_dat$svol,
highlight.3d = TRUE,
angle = 60,
type="h")
PCA, or principal component analysis,is a commonly used analysis method in computational biology. According to the Higgs paper, a PCA is a way of “visualizing the important features of multidimensional data sets, such as tables of amino acid properties. PCA chooses coordinates that are linear combinations of the original variables in such a way that the maximum variability between the data points is explained by the first few coordinates” (Higgs and Attwood 2005, pg 25). The two basic ways we make PCA plots in R are through base R and through the vegan package. The rda() function in vegan should be used.
Data Prep The data must be prepped to remove categorical variables. A new data frame will be made of just numerical values. Only 8 of the 11 values will be used to align with the Higgs and Attwood table.
aa_dat2 = data.frame(vol, bulk, pol, isoelec,H2Opho.34, H2Opho.35, saaH2O, faal.fold)
Figure 7: Principal components analysis - base R PCA plot with base R command
pca.out <- prcomp(aa_dat2[,-c(7,8)], scale = TRUE)
biplot(pca.out)
Figure 8: Principal component analysis - vegan PCA using vegan; for another example see https://rpubs.com/brouwern/veganpca
rda.out <- vegan::rda(aa_dat2, scale = TRUE)
Extract the known PCA score
rda_scores <- scores(rda.out)
Making a “biplot,” sometimes referred to as a PCA plot
biplot(rda.out, display = "sites", col = 0,)
orditorp(rda.out, display = "sites", labels = aa_dat$aa, cex = 1.2)
ordihull(rda.out,
group = aa_dat$hydropathy,
col = 1:7,
lty = 1:7,
lwd = c(3,6))
This figure is different from the one in Figure 2.10, page 28, of Higgs and Attwood. One difference, for instance, is that this plot has groupings while the Higgs and Attwood plot does not.
Cluster Analysis Heirarchical Cluster Analysis
Cluster analysis is used to break down a large data set into useful subgroups. One type of cluster analysis is UPGMA, which is more specifically a hierarchical cluster analysis.
Here, are the row names for the UPGMA.
row.names(aa_dat2) <- paste(aa_dat$aa,1:nrow(aa_dat2),sep = ".")
row.names(aa_dat2)
## [1] "A.1" "C.2" "D.3" "E.4" "F.5" "G.6" "H.7" "I.8" "K.9" "L.10"
## [11] "M.11" "N.12" "P.13" "Q.14" "R.15" "S.16" "T.17" "V.18" "W.19" "Y.20"
A distance matrix can be made with the function dist() using the euclidean algorithm (method = “euclidian”). The distance matrix creates a two dimensional matrix, where the distances can be determined to represent differences or similarities between two points.
dist_euc <- dist(aa_dat2[,-9],
method = "euclidean")
A cluster analysis can be done with hclust() function with the UPGMA algorithm.
clust_euc <- hclust(dist_euc, method = "average")
#setting the method to "average" makes the hclust() function perform a UPGMA cluster analysis. Hclust() defaults to a complete cluster analysis.
Plotting a dendrogram The cluster diagram compares to Higgs and Atwood’s in Plate 2.2 in that Higgs and Attwood included a heat map in their dendrogram and the clustering is different in their histogram compared to this one.
par(mar = c(1,1,1,1))
dendro_euc <- as.dendrogram(clust_euc)
plot(dendro_euc,horiz = T,nodePar = list(pch = c(1,NA),
cex = 0.5,
lab.cex = 0.5))