In this R script, we look to replicate the work of Higgs and Atwood in analyzing the properties of the twenty different amino acids. These include numerical variables, such as the size and isoelectric points of the molecules, as well as categorical variables such as a given amino acid’s hydrophobicity. We then make extensive use of PCA and clustering algorithms to sort the amino acids which share such similar properties. Doing so will allow us to gain a better unerstanding of the mechanisms in which different amino acids function, a task which has wide implications for the treatment of diseases as well as drug discovery.
We begin by loading the necessary packages into R. Here, we note that the “permute” package helps generate data permutations which can be plotted, while the “lattice” package functions as a “high-level data visualization system”, which will aid us in the creation of some plots and graphs.
The “ggplot2” package is perhaps the preeminent data organizing/plotting package used in R - due to its complexity, we also run “ggpubr” as a wrapper, an accompanying package which simplifies its use.
library(ggplot2)
library(ggpubr)
## Warning: package 'ggpubr' was built under R version 4.0.3
library(permute)
## Warning: package 'permute' was built under R version 4.0.3
library(lattice)
library(vegan)
## Warning: package 'vegan' was built under R version 4.0.3
## This is vegan 2.5-6
library(scatterplot3d)
## Warning: package 'scatterplot3d' was built under R version 4.0.3
We begin with the following vectors. Each vector listed describes a number of different classifications for each different attribute of the given amino acid sequence. We will use this information throughout the analysis.
# 1 letter code
aa <-c('A','C','D','E','F','G','H','I','K','L','M','N',
'P','Q','R','S','T','V','W','Y')
## molecular weight in dalton
MW.da <-c(89,121,133,146,165,75,155,131,146,131,149,132,115,147,174,105,119,117,204,181)
## vol from van der Waals radii
vol <-c(67,86,91,109,135,48,118,124,135,124,124,96,90,
114,148,73,93,105,163,141)
## bulk – a measure of the shape of the side chain
bulk <-c(11.5,13.46,11.68,13.57,19.8,3.4,13.69,21.4,
15.71,21.4,16.25,12.28,17.43,
14.45,14.28,9.47,15.77,21.57,21.67,18.03)
## pol – a measure of the electric field strength around the molecule
pol <-c(0,1.48,49.7,49.9,0.35,0,51.6,0.13,49.5,0.13,
1.43,3.38,1.58,3.53,52,1.67,1.66,0.13,2.1,1.61)
## isoelec point
isoelec <-c(6,5.07,2.77,3.22,5.48,5.97,7.59,6.02,9.74,5.98,
5.74,5.41,6.3,5.65,10.76,5.68,6.16,5.96,5.89,5.66)
## 1st Hydrophobicity scale
H2Opho.34 <-c(1.8,2.5,-3.5,-3.5,2.8,-0.4,-3.2,4.5,-3.9,3.8,1.9,
-3.5,-1.6,-3.5,-4.5,-0.8,-0.7,4.2,-0.9,-1.3)
## 2nd Hydrophobicity scale
H2Opho.35 <-c(1.6,2,-9.2,-8.2,3.7,1,-3,3.1,-8.8,2.8,3.4,-4.8,
-0.2,-4.1,-12.3,0.6,1.2,2.6,1.9,-0.7)
## Surface area accessible to water in an unfolded peptide
saaH2O <-c(113,140,151,183,218,85,194,182,211,180,204,158,
143,189,241,122,146,160,259,229)
## Fraction of accessible area lost when a protein folds
faal.fold <-c(0.74,0.91,0.62,0.62,0.88,0.72,0.78,0.88,0.52,
0.85,0.85,0.63,0.64,0.62,0.64,0.66,0.7,0.86,0.85,0.76)
# Polar requirement
polar.req <-c(7,4.8,13,12.5,5,7.9,8.4,4.9,10.1,4.9,5.3,10,
6.6,8.6,9.1,7.5,6.6,5.6,5.2,5.4)
## relative frequency of occurrence
## "The frequencies column shows the mean
### percentage of each amino acid in the protein sequences
### of modern organisms"
freq <-c(7.8,1.1,5.19,6.72,4.39,6.77,2.03,6.95,6.32,
10.15,2.28,4.37,4.26,3.45,5.23,6.46,5.12,7.01,1.09,3.3)
##
# charges
## un = Un-charged
## neg = negative
## pos = positive
charge<-c('un','un','neg','neg','un','un','pos','un','pos','un','un','un','un','un','pos','un','un','un','un','un')
## hydropathy
hydropathy<-c('hydrophobic','hydrophobic','hydrophilic','
hydrophilic','hydrophobic','neutral','neutral',
'hydrophobic','hydrophilic','hydrophobic','hydrophobic',
'hydrophilic','neutral',
'hydrophilic','hydrophilic','neutral','neutral',
'hydrophobic','hydrophobic','neutral')
## vol
vol.cat<-c('verysmall','small','small','medium',
'verylarge','verysmall','medium','large','large',
'large','large','small','small','medium','large',
'verysmall',
'small','medium','verylarge','verylarge')
## pol
pol.cat<-c('nonpolar','nonpolar','polar','polar',
'nonpolar','nonpolar','polar','nonpolar',
'polar','nonpolar','nonpolar','polar','nonpolar','polar',
'polar','polar','polar','nonpolar','nonpolar','polar')
## chemical
chemical<-c('aliphatic','sulfur','acidic','acidic','aromatic',
'aliphatic','basic','aliphatic','basic','aliphatic','sulfur',
'amide','aliphatic','amide','basic','hydroxyl','hydroxyl',
'aliphatic','aromatic','aromatic')
aa_dat <- data.frame(aa, MW.da, vol, bulk, pol, isoelec, H2Opho.34, H2Opho.35, saaH2O, faal.fold, polar.req, freq, charge, hydropathy, vol.cat, pol.cat, chemical)
aa_dat
## aa MW.da vol bulk pol isoelec H2Opho.34 H2Opho.35 saaH2O faal.fold
## 1 A 89 67 11.50 0.00 6.00 1.8 1.6 113 0.74
## 2 C 121 86 13.46 1.48 5.07 2.5 2.0 140 0.91
## 3 D 133 91 11.68 49.70 2.77 -3.5 -9.2 151 0.62
## 4 E 146 109 13.57 49.90 3.22 -3.5 -8.2 183 0.62
## 5 F 165 135 19.80 0.35 5.48 2.8 3.7 218 0.88
## 6 G 75 48 3.40 0.00 5.97 -0.4 1.0 85 0.72
## 7 H 155 118 13.69 51.60 7.59 -3.2 -3.0 194 0.78
## 8 I 131 124 21.40 0.13 6.02 4.5 3.1 182 0.88
## 9 K 146 135 15.71 49.50 9.74 -3.9 -8.8 211 0.52
## 10 L 131 124 21.40 0.13 5.98 3.8 2.8 180 0.85
## 11 M 149 124 16.25 1.43 5.74 1.9 3.4 204 0.85
## 12 N 132 96 12.28 3.38 5.41 -3.5 -4.8 158 0.63
## 13 P 115 90 17.43 1.58 6.30 -1.6 -0.2 143 0.64
## 14 Q 147 114 14.45 3.53 5.65 -3.5 -4.1 189 0.62
## 15 R 174 148 14.28 52.00 10.76 -4.5 -12.3 241 0.64
## 16 S 105 73 9.47 1.67 5.68 -0.8 0.6 122 0.66
## 17 T 119 93 15.77 1.66 6.16 -0.7 1.2 146 0.70
## 18 V 117 105 21.57 0.13 5.96 4.2 2.6 160 0.86
## 19 W 204 163 21.67 2.10 5.89 -0.9 1.9 259 0.85
## 20 Y 181 141 18.03 1.61 5.66 -1.3 -0.7 229 0.76
## polar.req freq charge hydropathy vol.cat pol.cat chemical
## 1 7.0 7.80 un hydrophobic verysmall nonpolar aliphatic
## 2 4.8 1.10 un hydrophobic small nonpolar sulfur
## 3 13.0 5.19 neg hydrophilic small polar acidic
## 4 12.5 6.72 neg \nhydrophilic medium polar acidic
## 5 5.0 4.39 un hydrophobic verylarge nonpolar aromatic
## 6 7.9 6.77 un neutral verysmall nonpolar aliphatic
## 7 8.4 2.03 pos neutral medium polar basic
## 8 4.9 6.95 un hydrophobic large nonpolar aliphatic
## 9 10.1 6.32 pos hydrophilic large polar basic
## 10 4.9 10.15 un hydrophobic large nonpolar aliphatic
## 11 5.3 2.28 un hydrophobic large nonpolar sulfur
## 12 10.0 4.37 un hydrophilic small polar amide
## 13 6.6 4.26 un neutral small nonpolar aliphatic
## 14 8.6 3.45 un hydrophilic medium polar amide
## 15 9.1 5.23 pos hydrophilic large polar basic
## 16 7.5 6.46 un neutral verysmall polar hydroxyl
## 17 6.6 5.12 un neutral small polar hydroxyl
## 18 5.6 7.01 un hydrophobic medium nonpolar aliphatic
## 19 5.2 1.09 un hydrophobic verylarge nonpolar aromatic
## 20 5.4 3.30 un neutral verylarge polar aromatic
Here, we will establish a correlation matrix.
cor_ <- round(cor(aa_dat[,-c(1,13:17)]),2)
diag(cor_) <- NA
cor_[upper.tri(cor_)] <- NA
cor_
## MW.da vol bulk pol isoelec H2Opho.34 H2Opho.35 saaH2O faal.fold
## MW.da NA NA NA NA NA NA NA NA NA
## vol 0.93 NA NA NA NA NA NA NA NA
## bulk 0.55 0.73 NA NA NA NA NA NA NA
## pol 0.29 0.24 -0.20 NA NA NA NA NA NA
## isoelec 0.20 0.36 0.08 0.27 NA NA NA NA NA
## H2Opho.34 -0.27 -0.08 0.44 -0.67 -0.20 NA NA NA NA
## H2Opho.35 -0.25 -0.16 0.32 -0.85 -0.26 0.85 NA NA NA
## saaH2O 0.97 0.99 0.64 0.29 0.35 -0.18 -0.23 NA NA
## faal.fold 0.11 0.18 0.49 -0.53 -0.18 0.84 0.79 0.12 NA
## polar.req -0.05 -0.19 -0.53 0.76 -0.11 -0.79 -0.87 -0.11 -0.81
## freq -0.52 -0.30 -0.04 -0.01 0.02 0.26 -0.02 -0.38 -0.18
## polar.req freq
## MW.da NA NA
## vol NA NA
## bulk NA NA
## pol NA NA
## isoelec NA NA
## H2Opho.34 NA NA
## H2Opho.35 NA NA
## saaH2O NA NA
## faal.fold NA NA
## polar.req NA NA
## freq 0.14 NA
which(cor_ == max(cor_, na.rm = T), arr.ind = T)
## row col
## saaH2O 8 2
which(cor_ == min(cor_, na.rm = T), arr.ind = T)
## row col
## polar.req 10 7
We observe that the above code highlights the maximum positive correlation between two variables in the matrix (in this case, between “saaH2O” and “vol” with an r-value of 0.99). By changing the “max” to “min”, we will obtain the maximum negative correlation between two variables, found between “polar.req” and “H2Opho.35”, with an r-value of -0.87.
To create our scatterplot matrix, which will more clearly display the strength of relationships between variables, we first run the code below. This will serve as a correlation coefficient between two data series.
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor,...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r)
}
Now, we may generate our scatterplot matrix.
plot( ~ MW.da + vol + pol + H2Opho.34 + polar.req + H2Opho.35 + saaH2O,
data = na.omit(aa_dat[,c("MW.da","vol","pol","H2Opho.34","polar.req","H2Opho.35", "saaH2O")]),
panel = panel.smooth,
upper.panel =panel.cor )
We observe a high r-value of 0.99 between “vol” and “saaH2O”, these two variables clearly have the strongest correlation among those plotted. Conversely, “MW.da” and “polar.req” have the lowest degree of correlation - their r-value of -0.05 is very close to zero, indication very little to no correlation between the two. We also observe that some relationships, like that between “saaH2O” and “H2Opho.35”, appear to be nonlinear.
The following code generates a 2D graph displaying the relationship between the volume of an amino acid, and its isoelectric point, as first performed by Higgs and Atwood.
plot(vol ~ isoelec, data = aa_dat,
xlab = "pI",
ylab = "Volume",
main = "Volume vs pI in Protein Folding",
col = 0) #this chunk of code is doing the main plotting
text(vol ~ isoelec,
labels = aa,
data = aa_dat,
col = 1:20) #this chunk of code adds the labels
## Figure 2: HA Figure 2.8 with ggpubr
ggscatter(y = "vol",
x = "isoelec",
size = "H2Opho.34",
color = "H2Opho.35",
data = aa_dat,
xlab = "pI",
ylab = "Volume")
## Figure 3: Highly Correlated Data Now, we will create a plot to analyze the relationship between two highly correlated variables: “SaaH2O” and “vol”.
ggscatter(y = "saaH2O",
x = "vol",
add = "reg.line", # regression line added
ellipse = TRUE, # ellipse included
cor.coef = TRUE, # correlation coefficient included
data = aa_dat,
xlab = "Volume from van der Waals radii",
ylab = "Surface area accessible to water in unfolded peptide")
## `geom_smooth()` using formula 'y ~ x'
Our data ellipse here works by encircling the underlying mean proportional to a given confidence interval, which itself is related to our calculated p-value. We also observe that our regression line appears to follow a linear relationship.
Here, we graph two variables whose relationship is nonlinear - that is, there is no straight line correlation between the two.
ggscatter(y = "pol",
x = "MW.da",
color = "isoelec",
size = "faal.fold",
add = "loess",
data = aa_dat,
xlab = "Polarity",
ylab = "Molecular weight in daltons")
## `geom_smooth()` using formula 'y ~ x'
First, we log-transform the two variables plotted on the x and y axes
aa_dat $ MW.da_log <- log(aa_dat[,"MW.da"])
aa_dat $ pol_log <- log(aa_dat[,"pol"])
Next, we create a log scale with these newly transformed variables
ggscatter(y = "pol_log",
x = "MW.da_log",
add = "reg.line",
color = "isoelec",
size = "faal.fold",
xlab = "ln(Molecular weight in daltons)",
ylab = "ln(Polarity)",
data = aa_dat)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (stat_smooth).
The log transformation takes the natural logarithm of each value in the given vector, allowing the regression line to appear more linear.
We observe in the scatterplot matrix that the variables “vol”, “MW.da”, and “saaH2O” appear to have the strongest correlation, all of which appear to be linear with r-values > 0.97. Graphing the 3D plot, we have the following.
scatterplot3d(x = aa_dat$vol,
y = aa_dat$MW.da,
z = aa_dat$saaH2O,
highlight.3d = TRUE,
type = "h")
Principal component analysis works by gradually reducing the dimensions of an n-dimensional data matrix while using algorithms to group similar data points closer to one another.
Higgs and Atwood make extensive use of this technique in their work, writing “[PCA functions as] a way of combining the information from all eight properties into a two-dimensional graph” (Higgs and Atwood 2005, pg 24).
In our class, we made PCA plots using the “vegan” package, as well as the basic-R commands.
Now we will create a version of our dataframe sans any categorical variables. To do so, I looked at all the variables in “aa_dat”, and included only the numerical ones in this new dataframe.
aa_dat2 = data.frame(vol, bulk, pol, isoelec, H2Opho.34, H2Opho.35, saaH2O, faal.fold)
pca <- prcomp(aa_dat2, scale = TRUE)
biplot(pca)
We begin by running the PCA using vegan, and calling our output “rda.out”.
rda.out <- vegan::rda(aa_dat2, scale = TRUE)
Next, we extract the PCA score.
rda_scores <- scores(rda.out)
From here, we create a PCA plot using the “biplot” function. We observe, that when compared to the PCA plot generated by Higgs and Atwood, that both the x and y-axis scales differ. However, the shape and location of the various groupings appear to resemble that of the original PCA plot.
biplot(rda.out, display = "sites", col = 0,)
orditorp(rda.out, display = "sites", labels = aa_dat$aa, cex = 1.2)
ordihull(rda.out,
group = aa_dat$charge,
col = 1:7,
lty = 1:7,
lwd = c(3,6))
Now we turn our focus to cluster analysis, which we have used in class to create phylogenetic trees from a given set of data. In this case, the clustering algorithm groups similar data points together.
One type of aforementioned clustering algorithms is UPGMA, known as a hierarchical agglomerative method, which means that data points begin in their own separate clusters before being joined to one another. The distances between all pairs of data points in two clusters are averaged, and the clusters with the shortest average distances are joined together. This process is repeated indefinitely until no cluster is unconnected.
row.names(aa_dat2) = paste(aa_dat$aa, sep = ".")
dist_euc <- dist(aa_dat2,
method = "euclidean")
clust_euc <- hclust(dist_euc,
method = "average") #average refers to "average linkage method", another name for UPGMA
dendro_euc <- as.dendrogram(clust_euc)
plot(dendro_euc,
horiz = T,
nodePar = list(pch = c(1,NA),
cex = 0.5,
lab.cex = 0.5))
My cluster diagram appears similar, at first glance, to the one created by Higgs and Atwood. The main differences appear to be the inclusion of a heat map to correlate differences between species, as well as minor discrepancies in the number of branches present (mine has two, theirs has three). This is likely due to different clustering algorithms being used.