#Introduction
This script is used to show an analysis of amino acids and their different properties. It is important because by understanding the different ways amino acids are similar and different contributes to their function. This script uses cluster analysis, PCA and other analysis techniques to find similarities and differences between different amino acids. This analysis offers insight to many aspects of amino acids and their functions.
#Preliminaries
##Packages
Ggpubr is a wrapper of ggplot2. This allows the creation of more complex and detailed graphs.
#install.packages("permute")
#install.packages("lattice")
#install.packages("ggpubr")
#install.packages("scatterplot3d")
#install.packages("vegan")
library(permute) #used in cluster analysis
library(lattice) #used in cluster analysis
library(ggpubr) #graphing tool
## Loading required package: ggplot2
library(scatterplot3d) #used to make a 3d scatterplot
library(vegan) #used in cluster analysis
## This is vegan 2.5-6
##Build the dataframe
This data is taken from the Higgs and Atwood paper. It is an array of different characteristics of each amino acid. All of these characteristics contribute to the amino acids function.
# 1 letter code
aa <-c('A','C','D','E','F','G','H','I','K','L','M','N',
'P','Q','R','S','T','V','W','Y')
# molecular weight in the dalton
MW.da <-c(89,121,133,146,165,75,155,131,146,131,149,132,115,147,174,105,119,117,204,181)
## vol from van der Waals radii
vol <-c(67,86,91,109,135,48,118,124,135,124,124,96,90,
114,148,73,93,105,163,141)
## bulk – a measure of the shape of the side chain
bulk <-c(11.5,13.46,11.68,13.57,19.8,3.4,13.69,21.4,
15.71,21.4,16.25,12.28,17.43,
14.45,14.28,9.47,15.77,21.57,21.67,18.03)
## pol – a measure of the electric field strength around the molecule
pol <-c(0,1.48,49.7,49.9,0.35,0,51.6,0.13,49.5,0.13,
1.43,3.38,1.58,3.53,52,1.67,1.66,0.13,2.1,1.61)
## isoelec point
isoelec <-c(6,5.07,2.77,3.22,5.48,5.97,7.59,6.02,9.74,5.98,
5.74,5.41,6.3,5.65,10.76,5.68,6.16,5.96,5.89,5.66)
## 1st Hydrophobicity scale
H2Opho.34 <-c(1.8,2.5,-3.5,-3.5,2.8,-0.4,-3.2,4.5,-3.9,3.8,1.9,
-3.5,-1.6,-3.5,-4.5,-0.8,-0.7,4.2,-0.9,-1.3)
## 2nd Hydrophobicity scale
H2Opho.35 <-c(1.6,2,-9.2,-8.2,3.7,1,-3,3.1,-8.8,2.8,3.4,-4.8,
-0.2,-4.1,-12.3,0.6,1.2,2.6,1.9,-0.7)
## Surface area accessible to water in an unfolded peptide
saaH2O <-c(113,140,151,183,218,85,194,182,211,180,204,158,
143,189,241,122,146,160,259,229)
## Fraction of accessible area lost when a protein folds
faal.fold <-c(0.74,0.91,0.62,0.62,0.88,0.72,0.78,0.88,0.52,
0.85,0.85,0.63,0.64,0.62,0.64,0.66,0.7,0.86,0.85,0.76)
# Polar requirement
polar.req <-c(7,4.8,13,12.5,5,7.9,8.4,4.9,10.1,4.9,5.3,10,
6.6,8.6,9.1,7.5,6.6,5.6,5.2,5.4)
## relative frequency of occurance
## "The frequencies column shows the mean
### percentage of each amino acid in the protein sequences
### of modern organisms"
freq <-c(7.8,1.1,5.19,6.72,4.39,6.77,2.03,6.95,6.32,
10.15,2.28,4.37,4.26,3.45,5.23,6.46,5.12,7.01,1.09,3.3)
##
# charges
## un = Un-charged
## neg = negative
## pos = positive
charge<-c('un','un','neg','neg','un','un','pos','un','pos','un','un','un','un','un','pos','un','un','un','un','un')
## hydropathy
hydropathy<-c('hydrophobic','hydrophobic','hydrophilic','
hydrophilic','hydrophobic','neutral','neutral',
'hydrophobic','hydrophilic','hydrophobic','hydrophobic',
'hydrophilic','neutral',
'hydrophilic','hydrophilic','neutral','neutral',
'hydrophobic','hydrophobic','neutral')
## vol
vol.cat<-c('verysmall','small','small','medium',
'verylarge','verysmall','medium','large','large',
'large','large','small','small','medium','large',
'verysmall',
'small','medium','verylarge','verylarge')
## pol
pol.cat<-c('nonpolar','nonpolar','polar','polar',
'nonpolar','nonpolar','polar','nonpolar',
'polar','nonpolar','nonpolar','polar','nonpolar','polar',
'polar','polar','polar','nonpolar','nonpolar','polar')
## chemical
chemical<-c('aliphatic','sulfur','acidic','acidic','aromatic',
'aliphatic','basic','aliphatic','basic','aliphatic','sulfur',
'amide','aliphatic','amide','basic','hydroxyl','hydroxyl',
'aliphatic','aromatic','aromatic')
log_vol = log(vol)
log_polar_req = log(polar.req)
aa_dat = data.frame(aa, MW.da, vol, bulk, pol, isoelec, H2Opho.34, H2Opho.35, saaH2O, faal.fold, polar.req, freq, charge, hydropathy, vol.cat, pol.cat, chemical, log_vol, log_polar_req)
#Raw data exploration
##Matrix
cor_ <- round(cor(aa_dat[,-c(1,13:17)]),2)
diag(cor_) <- NA
cor_[upper.tri(cor_)] <- NA
cor_
## MW.da vol bulk pol isoelec H2Opho.34 H2Opho.35 saaH2O
## MW.da NA NA NA NA NA NA NA NA
## vol 0.93 NA NA NA NA NA NA NA
## bulk 0.55 0.73 NA NA NA NA NA NA
## pol 0.29 0.24 -0.20 NA NA NA NA NA
## isoelec 0.20 0.36 0.08 0.27 NA NA NA NA
## H2Opho.34 -0.27 -0.08 0.44 -0.67 -0.20 NA NA NA
## H2Opho.35 -0.25 -0.16 0.32 -0.85 -0.26 0.85 NA NA
## saaH2O 0.97 0.99 0.64 0.29 0.35 -0.18 -0.23 NA
## faal.fold 0.11 0.18 0.49 -0.53 -0.18 0.84 0.79 0.12
## polar.req -0.05 -0.19 -0.53 0.76 -0.11 -0.79 -0.87 -0.11
## freq -0.52 -0.30 -0.04 -0.01 0.02 0.26 -0.02 -0.38
## log_vol 0.91 0.98 0.77 0.25 0.31 -0.07 -0.17 0.96
## log_polar_req -0.09 -0.22 -0.58 0.74 -0.03 -0.82 -0.86 -0.15
## faal.fold polar.req freq log_vol log_polar_req
## MW.da NA NA NA NA NA
## vol NA NA NA NA NA
## bulk NA NA NA NA NA
## pol NA NA NA NA NA
## isoelec NA NA NA NA NA
## H2Opho.34 NA NA NA NA NA
## H2Opho.35 NA NA NA NA NA
## saaH2O NA NA NA NA NA
## faal.fold NA NA NA NA NA
## polar.req -0.81 NA NA NA NA
## freq -0.18 0.14 NA NA NA
## log_vol 0.17 -0.16 -0.29 NA NA
## log_polar_req -0.86 0.99 0.14 -0.2 NA
This code shows where the max correlation occurs. It occurs between saaH2O and vol. The max negative correlation occurs between hydrophobe.35 and polar.req. The min correlation occurs between MW.da and faa.fold as well as MW.da and polar.req.
which(cor_ == max(cor_, na.rm = T), arr.ind = T)
## row col
## saaH2O 8 2
## log_polar_req 13 10
#Plot 0: Scatterplot matrix
pairs(aa_dat[2:8])
The below code will add lines to the graphs. It will also show how much each of the variables relate to each other. It makes the scatterplot matrices much easier to interpret.
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r)
}
Of the six variables we plotted MW.da and vol have a linear relationship. Pol and all other variables have essentially no relationship with two sections of values separated to either side of the graph. H2Opho.34 and polar.req have a somewhat linear relationship. The rest of the comparisons seem to have nonlinear relationships.
#Plot 1: Replication Higgs and Attwood Figure 2.8
This plot represents the correlation between the amino acid volume and the polar requirement.The main code constructs the plot comparing the two values. The rest of the plot gives a color to the different amino acids and the text labels tell us which each side represents. This allows for user readability for someone looking at the plot.
plot(polar.req ~ freq, data = aa_dat,
xlab = "Polar requirement",
ylab = "Volume (A^3)",
main = "Amino Acid vol vs. pI",
col = 0)
text(polar.req ~ freq,
labels = aa,
data = aa_dat,
col = 1:20)
#Figure 2: HA Figure 2.8 with ggpubr
ggscatter(y = "polar.req",
x = "freq",
size = "polar.req",
color = "polar.req",
data = aa_dat,
xlab = "polar requirement",
ylab = "Volume (A^3)")
#Figure 3: Highly correlated data
The regression line follows the equation y = b1x + b0.The ggpubr line add adds the line of regression and the line ellipse adds the ellipse and cor.coef adds the correlation coefficient. The data ellipse shows the spread of all the data. The smaller the ellipse the more closely related the data is.
ggscatter(y = "MW.da",
x = "vol",
size = "MW.da",
add = "reg.line", # line of best fit
ellipse = TRUE, # data ellipse
cor.coef = TRUE, # correlation coef
# color = "vol",
data = aa_dat,
xlab = "Volume",
ylab = "Molecular Weight")
## `geom_smooth()` using formula 'y ~ x'
#Figure 4: Non-linear relationship
ggscatter(data = aa_dat,
y = "vol",
x = "polar.req",
size = "MW.da",
color = "pol",
add = "loess") +
xlab("polar requirment") +
ylab("Volume")
## `geom_smooth()` using formula 'y ~ x'
#Figure 5: Non-linear relationship on LOG scale
This graph compares the correlation of pI and volume when it is transformed to the log scale. Based off the graphs these two values still do not correlate on the log scale.
ggscatter(data = aa_dat,
y = "log_vol",
x = "log_polar_req",
size = "bulk",
color = "freq",
add = "reg.line") +
xlab("log(pI)") +
ylab("log(Volume)")
## `geom_smooth()` using formula 'y ~ x'
#Figure 6: 3D Scatterplot
There does not seem to be too much correlation for these 3 variables overall. A lot of the values seemed to be placed in a mostly nonlinear manner. All of the data also seems to be distributed throughout the plot there is not specific clusters of data.
scatterplot3d(x = aa_dat$polar.req,
y = aa_dat$MW.da,
z = aa_dat$svol,
highlight.3d = TRUE,
angle = 60,
type="h")
#Principal component analysis
PCA is often used to find the correlation between multiple variables in a data set. It shows how each of the variables relate to each other in one graph to see if there are large points of correlation between multiple variables. We have used this in class to show correlation between different taxa to try to derive phylogenetic relation between species.
“It would be useful to plot some kind of diagram that lets us visualize the information in this table. It is straightforward to take any two of the properties and use these as the coordinates for the points in a two-dimensional graph”s (Higgs and Attwood 2005, pg 25).
We made PCA many ways in this class. One way was through the use of distance matrices. This allowed us to find the correlation between multiple variables and then compare them. We also made PCA through the use of R and modelling on the computer.
##Data Prep
I used table 2.2 from the article to create this variable. It is made of only the numerical values of the data used.
aa_dat2 = data.frame(vol, bulk, pol, isoelec,H2Opho.34, H2Opho.35, saaH2O, faal.fold)
##Figure 7: Principal components analysis - base R
pca.out <- prcomp(aa_dat2, scale = TRUE)
biplot(pca.out)
##Figure 8: Principal components analysis - vegan
The difference between the plot above and this plot is for one this plot provides a much easier reader experience. It groups the variables based off their correlation to charge. This resulted in 3 groups because charge has 3 variables (pos, neg, and un). The graph above compares all of the numerical values and their correlation to each other.
rda.out <- vegan::rda(aa_dat2,
scale = TRUE)
biplot(rda.out, display = "sites", col = 0,)
orditorp(rda.out, display = "sites", labels = aa_dat$aa, cex = 1.2)
ordihull(rda.out,
group = aa_dat$charge,
col = 1:7,
lty = 1:7,
lwd = c(3,6))
#Cluster Analysis
UPGMA creates a rooted dendrogram. This dendrogram can show the relation of different variables to others. This has been used for many different purposes including evolutionary closeness, behavior, and comparison of amino acids. They are very useful for showing correlation between variables.
##Hierarchal Cluster Analysis
A Distance matrix is used to find the relational distance of different variables. The variables clustered together are clustered together and then compared to remaining variables.
Euclidean distance is the distance between to points or two variables. It results in a line segment length between the points of the distance matrix.
# make smaller dataframe
aa_dat2 = aa_dat[,-c(1,
2,11,12,
13,14,15,16,17)]
#set row names
row.names(aa_dat2) = paste(aa_dat$aa,
sep = ".")
# do dist matrix stuff
dist_euc = dist(aa_dat2,
method = "euclidean")
clust_euc = hclust(dist_euc,
method = "average") #average sets it to use UPGMA
dendro_euc = as.dendrogram(clust_euc)
plot(dendro_euc,horiz = T,
nodePar = list(pch = c(1,NA),
cex = 0.5,
lab.cex = 1.2))
There are some differences between this dendrogram and the one in the Higgs and Atwood paper. For example their paper clusters K and R as sister species. Essentially the entire first clade is moved around a bit H and E are also shifted. There is also differences such as P and C are very closely connected in our tree but not as much in the Higgs and Atwood tree. Overall some of the amino acids are just shifted around in each tree which results in different relative closeness between amino acids.