This script replicates the Higgs and Attwood’s analysis on the properties of amino acids. With the data from their paper, we look at the properties of the amino acids, compare their similarities, and their specific functions. In this script, we mainly used PCA, clustering analysis and other R function.
library(ggplot2) # It is a flexible package for elegant data visualization in R. One word to describe the relationship between ggplot2 and ggpubr is wrapper.
## Warning: package 'ggplot2' was built under R version 4.0.3
library(ggpubr) # makes ggplot2 easier.The ‘ggpubr’ package provides some easy-to-use functions for creating and customizing ‘ggplot2’- based publication ready plots.
## Warning: package 'ggpubr' was built under R version 4.0.3
library(vegan) # PCA has functions doing/plotting PCA
## Warning: package 'vegan' was built under R version 4.0.3
## Loading required package: permute
## Warning: package 'permute' was built under R version 4.0.3
## Loading required package: lattice
## Warning: package 'lattice' was built under R version 4.0.3
## This is vegan 2.5-6
library(scatterplot3d) #make a 3d scatterplot
## Warning: package 'scatterplot3d' was built under R version 4.0.3
library(permute) #used for cluster analysis
library(lattice) #used for cluster analysis
These data are derived mostly from Higgs (2009) Table 1. They are similar to Higgs and Attwood 2005 but have extra columns.
Make the vectors.
# 1 letter code
aa <-c('A','C','D','E','F','G','H','I','K','L','M','N',
'P','Q','R','S','T','V','W','Y')
## molecular weight in dalton
MW.da <-c(89,121,133,146,165,75,155,131,146,131,149,132,115,147,174,105,119,117,204,181)
## vol from van der Waals radii
vol <-c(67,86,91,109,135,48,118,124,135,124,124,96,90,
114,148,73,93,105,163,141)
## bulk – a measure of the shape of the side chain
bulk <-c(11.5,13.46,11.68,13.57,19.8,3.4,13.69,21.4,
15.71,21.4,16.25,12.28,17.43,
14.45,14.28,9.47,15.77,21.57,21.67,18.03)
## pol – a measure of the electric field strength around the molecule
pol <-c(0,1.48,49.7,49.9,0.35,0,51.6,0.13,49.5,0.13,
1.43,3.38,1.58,3.53,52,1.67,1.66,0.13,2.1,1.61)
## isoelec point
isoelec <-c(6,5.07,2.77,3.22,5.48,5.97,7.59,6.02,9.74,5.98,
5.74,5.41,6.3,5.65,10.76,5.68,6.16,5.96,5.89,5.66)
## 1st Hydrophobicity scale
H2Opho.34 <-c(1.8,2.5,-3.5,-3.5,2.8,-0.4,-3.2,4.5,-3.9,3.8,1.9,
-3.5,-1.6,-3.5,-4.5,-0.8,-0.7,4.2,-0.9,-1.3)
## 2nd Hydrophobicity scale
H2Opho.35 <-c(1.6,2,-9.2,-8.2,3.7,1,-3,3.1,-8.8,2.8,3.4,-4.8,
-0.2,-4.1,-12.3,0.6,1.2,2.6,1.9,-0.7)
## Surface area accessible to water in an unfolded peptide
saaH2O <-c(113,140,151,183,218,85,194,182,211,180,204,158,
143,189,241,122,146,160,259,229)
## Fraction of accessible area lost when a protein folds
faal.fold <-c(0.74,0.91,0.62,0.62,0.88,0.72,0.78,0.88,0.52,
0.85,0.85,0.63,0.64,0.62,0.64,0.66,0.7,0.86,0.85,0.76)
# Polar requirement
polar.req <-c(7,4.8,13,12.5,5,7.9,8.4,4.9,10.1,4.9,5.3,10,
6.6,8.6,9.1,7.5,6.6,5.6,5.2,5.4)
## relative frequency of occurance
## "The frequencies column shows the mean
### percentage of each amino acid in the protein sequences
### of modern organisms"
freq <-c(7.8,1.1,5.19,6.72,4.39,6.77,2.03,6.95,6.32,
10.15,2.28,4.37,4.26,3.45,5.23,6.46,5.12,7.01,1.09,3.3)
##
# charges
## un = Un-charged
## neg = negative
## pos = positive
charge<-c('un','un','neg','neg','un','un','pos','un','pos',
'un','un','un','un','un','pos','un','un','un','un','un')
## hydropathy
hydropathy<-c('hydrophobic','hydrophobic','hydrophilic','
hydrophilic','hydrophobic','neutral','neutral',
'hydrophobic','hydrophilic','hydrophobic','hydrophobic',
'hydrophilic','neutral',
'hydrophilic','hydrophilic','neutral','neutral',
'hydrophobic','hydrophobic','neutral')
## vol
vol.cat<-c('verysmall','small','small','medium',
'verylarge','verysmall','medium','large','large',
'large','large','small','small','medium','large',
'verysmall',
'small','medium','verylarge','verylarge')
## pol
pol.cat<-c('nonpolar','nonpolar','polar','polar',
'nonpolar','nonpolar','polar','nonpolar',
'polar','nonpolar','nonpolar','polar','nonpolar','polar',
'polar','polar','polar','nonpolar','nonpolar','polar')
## chemical
chemical<-c('aliphatic','sulfur','acidic','acidic','aromatic',
'aliphatic','basic','aliphatic','basic','aliphatic','sulfur',
'amide','aliphatic','amide','basic','hydroxyl','hydroxyl',
'aliphatic','aromatic','aromatic')
Build the dataframes.
### Build dataframe
aa_dat <- data.frame(aa,
MW.da,vol,
bulk, pol,
isoelec,H2Opho.34, H2Opho.35,
saaH2O, faal.fold, polar.req,
freq, charge, hydropathy,
vol.cat, pol.cat, chemical)
aa_dat
## aa MW.da vol bulk pol isoelec H2Opho.34 H2Opho.35 saaH2O faal.fold
## 1 A 89 67 11.50 0.00 6.00 1.8 1.6 113 0.74
## 2 C 121 86 13.46 1.48 5.07 2.5 2.0 140 0.91
## 3 D 133 91 11.68 49.70 2.77 -3.5 -9.2 151 0.62
## 4 E 146 109 13.57 49.90 3.22 -3.5 -8.2 183 0.62
## 5 F 165 135 19.80 0.35 5.48 2.8 3.7 218 0.88
## 6 G 75 48 3.40 0.00 5.97 -0.4 1.0 85 0.72
## 7 H 155 118 13.69 51.60 7.59 -3.2 -3.0 194 0.78
## 8 I 131 124 21.40 0.13 6.02 4.5 3.1 182 0.88
## 9 K 146 135 15.71 49.50 9.74 -3.9 -8.8 211 0.52
## 10 L 131 124 21.40 0.13 5.98 3.8 2.8 180 0.85
## 11 M 149 124 16.25 1.43 5.74 1.9 3.4 204 0.85
## 12 N 132 96 12.28 3.38 5.41 -3.5 -4.8 158 0.63
## 13 P 115 90 17.43 1.58 6.30 -1.6 -0.2 143 0.64
## 14 Q 147 114 14.45 3.53 5.65 -3.5 -4.1 189 0.62
## 15 R 174 148 14.28 52.00 10.76 -4.5 -12.3 241 0.64
## 16 S 105 73 9.47 1.67 5.68 -0.8 0.6 122 0.66
## 17 T 119 93 15.77 1.66 6.16 -0.7 1.2 146 0.70
## 18 V 117 105 21.57 0.13 5.96 4.2 2.6 160 0.86
## 19 W 204 163 21.67 2.10 5.89 -0.9 1.9 259 0.85
## 20 Y 181 141 18.03 1.61 5.66 -1.3 -0.7 229 0.76
## polar.req freq charge hydropathy vol.cat pol.cat
## 1 7.0 7.80 un hydrophobic verysmall nonpolar
## 2 4.8 1.10 un hydrophobic small nonpolar
## 3 13.0 5.19 neg hydrophilic small polar
## 4 12.5 6.72 neg \n hydrophilic medium polar
## 5 5.0 4.39 un hydrophobic verylarge nonpolar
## 6 7.9 6.77 un neutral verysmall nonpolar
## 7 8.4 2.03 pos neutral medium polar
## 8 4.9 6.95 un hydrophobic large nonpolar
## 9 10.1 6.32 pos hydrophilic large polar
## 10 4.9 10.15 un hydrophobic large nonpolar
## 11 5.3 2.28 un hydrophobic large nonpolar
## 12 10.0 4.37 un hydrophilic small polar
## 13 6.6 4.26 un neutral small nonpolar
## 14 8.6 3.45 un hydrophilic medium polar
## 15 9.1 5.23 pos hydrophilic large polar
## 16 7.5 6.46 un neutral verysmall polar
## 17 6.6 5.12 un neutral small polar
## 18 5.6 7.01 un hydrophobic medium nonpolar
## 19 5.2 1.09 un hydrophobic verylarge nonpolar
## 20 5.4 3.30 un neutral verylarge polar
## chemical
## 1 aliphatic
## 2 sulfur
## 3 acidic
## 4 acidic
## 5 aromatic
## 6 aliphatic
## 7 basic
## 8 aliphatic
## 9 basic
## 10 aliphatic
## 11 sulfur
## 12 amide
## 13 aliphatic
## 14 amide
## 15 basic
## 16 hydroxyl
## 17 hydroxyl
## 18 aliphatic
## 19 aromatic
## 20 aromatic
This code creates a correlation matrix. It uses the function to find out which variable has the most positive or negative correlation. saaH20 has the most postive correlation. Polar.req has the most negative correlation.
cor_ <- round(cor(aa_dat[,-c(1,13:17)]),2)
diag(cor_) <- NA
cor_[upper.tri(cor_)] <- NA
cor_
## MW.da vol bulk pol isoelec H2Opho.34 H2Opho.35 saaH2O faal.fold
## MW.da NA NA NA NA NA NA NA NA NA
## vol 0.93 NA NA NA NA NA NA NA NA
## bulk 0.55 0.73 NA NA NA NA NA NA NA
## pol 0.29 0.24 -0.20 NA NA NA NA NA NA
## isoelec 0.20 0.36 0.08 0.27 NA NA NA NA NA
## H2Opho.34 -0.27 -0.08 0.44 -0.67 -0.20 NA NA NA NA
## H2Opho.35 -0.25 -0.16 0.32 -0.85 -0.26 0.85 NA NA NA
## saaH2O 0.97 0.99 0.64 0.29 0.35 -0.18 -0.23 NA NA
## faal.fold 0.11 0.18 0.49 -0.53 -0.18 0.84 0.79 0.12 NA
## polar.req -0.05 -0.19 -0.53 0.76 -0.11 -0.79 -0.87 -0.11 -0.81
## freq -0.52 -0.30 -0.04 -0.01 0.02 0.26 -0.02 -0.38 -0.18
## polar.req freq
## MW.da NA NA
## vol NA NA
## bulk NA NA
## pol NA NA
## isoelec NA NA
## H2Opho.34 NA NA
## H2Opho.35 NA NA
## saaH2O NA NA
## faal.fold NA NA
## polar.req NA NA
## freq 0.14 NA
which(cor_ == max(cor_, na.rm = T), arr.ind = T) #to find the most positive correlation
## row col
## saaH2O 8 2
which(cor_ == min(cor_, na.rm = T), arr.ind = T) #to find the most negative correlation
## row col
## polar.req 10 7
This code set up the function to plot the scatter matrix. And it also has the lines of best fit to allow us to see the relationship between the variables.
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor,...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r)
}
aa_dat_new <- data.frame(MW.da,vol,pol,H2Opho.34, polar.req,saaH2O, faal.fold)
plot(aa_dat_new, upper.panel = panel.cor,
panel = panel.smooth)
From the graph, we can know that volume and saaH2O has the s
the relationships between several variables are strong: saaH2O/volume, saaH2O/MW.da,and so forth. the relationships between several variables are weak: saaH2O/pol, pol/Mw.da and so forth. the relationship between variables is non-linear: Pol/H2Opo.34.
This plot is the replicated copy of the Figure 2.8 from the Higgs and Atwood paper showing the relationship between amino acid volume and pI.
plot(polar.req ~ freq, data = aa_dat, #draw data from the aa_data to make the plot
xlab = "pI", #label the x axis
ylab = "Volume(Å3)", #label the y axis
main = "Plot of amino acid volume against pI", #name the title for the plot
col = 0)
text(polar.req ~ freq, #main plotting
labels = aa,
data = aa_dat,
col = 1:20)
This plot shows the relationship between frequency and polar requirement. Since the graph uses various sizes to demonstrate the bulkiness, and different color to denote molecular weight, we are able to see 4 variables in one plot.
ggscatter(y = "polar.req",
x = "freq",
size = "polar.req",
color = "polar.req",
data = aa_dat,
xlab = "Frequency",
ylab = "Polar Requirement")
This plot shows the non-linear relationship between polarity and molecular weight.
ggscatter(y = "pol",
x = "MW.da",
size = "vol",
add = "loess", # add a loess smoother
cor.coef = TRUE, # correlation coef
color = "faal.fold",
data = aa_dat,
xlab = "Molecular Weight in Dalton",
ylab = " Polarity")
## `geom_smooth()` using formula 'y ~ x'
The log scale transform the plot to its linear form by plotting the original data’s after performing the log function. After the log-transformation, we can get a a linear regression line in spite of the non-linear relationship.
aa_dat$pol_log <- log(aa_dat[,"pol"])
aa_dat$MW.da_log <- log(aa_dat[,"MW.da"])
ggscatter(y = "pol_log",
x = "MW.da_log",
size = "vol",
add = "reg.line", # line of best fit
color = "faal.fold",
data = aa_dat,
xlab = "Molecular Weight in Dalton",
ylab = "Polarity")
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (stat_smooth).
The code below will create a 3D scatter plot that demonstrates the relationship between MW.da, volume, and saaH2O. These three variables have relatively strong correlations with one another. It means, as x(Mw.da) increases, y(volume) and z(saaH2O) also increase.
To interprete the correlation in a 3D scatterplot, we can look at each plane seperately. To find out the relationship between x(MW.da) and y(vol), we look at the bottom plane. To find out the relationship between x(MW.da) and z(saaH2O),we looking the plane where it forms a diagonal line.
scatterplot3d(x = aa_dat$MW.da,
y = aa_dat$vol,
z = aa_dat$saaH2O,
xlab = "MW.da",
ylab = "vol",
zlab = "saaH2O",
highlight.3d = TRUE,
type = "h")
Higgs and Attwood define Principal component analysis (PCA) as “a way of visualizing the important features of multidimensional data sets, such as tables of amino acid properties”(2005, pg 34).PCA is a statistical technique used for reducing the dimmentionality of a large dataset. It is capable of recalculating coordinates for the data, then essentially flattening the data down. In biology or ecology, we usually use PCA for phylogenetic analysis.
A new dataframe, aa_dat2, is created with 8 variables. THose 8 variables are the ones in the Table 2.2.
aa_dat2 <- data.frame(vol,bulk,pol, isoelec, H2Opho.34, H2Opho.35, saaH2O,faal.fold)
This code is for creating a PCA plot by a base R command and plotting the output.
pca.out <- prcomp(aa_dat2, scale = TRUE) #Run the PCA
biplot(pca.out) #Plot the output
This code is for running the PCA using vegan and call the output rda.out.
rda.out <- vegan::rda(aa_dat2,
scale = TRUE) # to properly scale numeric data by using their respective standard deviation
rda_scores <- scores(rda.out) #extract the PCA score and name it as rda_scores
rda_scores
## $species
## PC1 PC2
## vol 0.1292455 -1.21867402
## bulk -0.5240422 -1.00021659
## pol 1.0392368 -0.22161142
## isoelec 0.4272346 -0.51780748
## H2Opho.34 -1.1576058 -0.06083202
## H2Opho.35 -1.1909235 0.07563303
## saaH2O 0.2293130 -1.17930581
## faal.fold -1.0664891 -0.35159129
##
## $sites
## PC1 PC2
## sit1 -0.431697708 1.01348876
## sit2 -0.812161280 0.50497652
## sit3 0.958626669 0.71357979
## sit4 0.941582653 0.22249037
## sit5 -0.867271594 -0.76051369
## sit6 -0.093468837 1.79557675
## sit7 0.643286377 -0.33344292
## sit8 -1.013877712 -0.56001005
## sit9 1.450808627 -0.67736662
## sit10 -0.904499352 -0.52173690
## sit11 -0.668686375 -0.38778285
## sit12 0.462264010 0.53066191
## sit13 0.007416261 0.36866189
## sit14 0.458970843 0.05206945
## sit15 1.580353509 -1.08451465
## sit16 -0.014704047 1.08414678
## sit17 -0.184996200 0.37158066
## sit18 -0.979015329 -0.23223164
## sit19 -0.420924604 -1.37484059
## sit20 -0.112005910 -0.72479295
##
## attr(,"const")
## [1] 3.511243
This code is for creating a biplot.
biplot(rda.out, display = "sites", col = 0,)
orditorp(rda.out, display = "sites", labels = aa_dat$aa, cex = 1.2)
ordihull(rda.out,
group = aa_dat$charge,
col = 1:7,
lty = 1:7,
lwd = c(3,6))
The plot created above is very different from the one in the in the Higgs and Atwood paper. 1. the scale for these two plots are different 2. the data points in the paper are more spread out comparing to the plot created above. 3. the orientation of the figure in the paper is the upside-down version of the plot created above. 4. in the original paper, there is no grouping indicated in the figure. But the plot created above make use of red, green and black lines to show the grouping to show the relationship.
The purpose of cluster analysis is classify different objects into similar groups.To categorize them into cluster, we base on the similarities or dissimilarities between the objects.
Unweighted pair-group method with arithmetic mean (UPGMA) is a type of clustering method that utilizes the hierarchical clustering method. Since UPGMA is a distance method, it needs a distance matrix. And it can be used for generating visual plots such as dendrograms.
row.names(aa_dat2) <- paste(aa_dat$aa, sep = ".")
dist_euc <- dist(aa_dat2,
method = "euclidean")
clust_euc<- hclust(dist_euc,
method = "average") #this method sets hclust to UPGMA
dendro_euc = as.dendrogram(clust_euc)
plot(dendro_euc,horiz = T,
nodePar = list(pch = c(1,NA),
cex = 0.5,
lab.cex = 1.2))
This plot is a distance matrix to show distance between objects: amino acids. The distance is calculated by Euclidean distance matrix based on the different properties.An Euclidean distance matrix is a matrix containing squared distances between two points. As an equation, it can be expressed as a sum of squares.
This cluster analysis has some differences in the comparison to the one in the Higgs and Atwood paper.The difference is the relative proximity between amino acids. The distance between some amino acids gets changed: some are closer and some are further.