An analysis of a set of amino acids is done by this script. The script uses R functions and packages to analyze the properties and similarities of amino acids. ## Preliminaries ### Packages 1. ggplot2 create elegant data visualisations using the grammar of graphics 2. ggpubr General Arguments Description 3. scatterplot3d: Plots a three dimensional point cloud. 4. vegan:the vegan package provides tools for descriptive community ecology.
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.5-6
GGpubr makes ggplot2 easier allowing the user to create more complex and compounded graphs.
library(permute)
library(lattice)
These data are derived mostly from Higgs (2009) Table 1. They are similar to Higgs and Attwood 2005 but have extra columns. Nathan Brouwer added additional categorical varibles based on online information.
Make the vectors.
# 1 letter code
aa <-c('A','C','D','E','F','G','H','I','K','L','M','N','P','Q','R','S','T','V','W','Y')
# molecular weight in dalton
MW.da <-c(89,121,133,146,165,75,155,131,146,131,149,132,115,147,174,105,119,117,204,181)
# vol from van der Waals radii
volume <-c(67,86,91,109,135,48,118,124,135,124,124,96,90,
114,148,73,93,105,163,141)
# bulk – a measure of the shape of the side chain
bulkiness <-c(11.5,13.46,11.68,13.57,19.8,3.4,13.69,21.4,
15.71,21.4,16.25,12.28,17.43,
14.45,14.28,9.47,15.77,21.57,21.67,18.03)
## pol – a measure of the electric field strength around the molecule
polarity <-c(0,1.48,49.7,49.9,0.35,0,51.6,0.13,49.5,0.13,
1.43,3.38,1.58,3.53,52,1.67,1.66,0.13,2.1,1.61)
# isoelec point
isoelectric.pt <-c(6,5.07,2.77,3.22,5.48,5.97,7.59,6.02,9.74,5.98,
5.74,5.41,6.3,5.65,10.76,5.68,6.16,5.96,5.89,5.66)
# 1st Hydrophobicity scale
hydrophobe.34 <-c(1.8,2.5,-3.5,-3.5,2.8,-0.4,-3.2,4.5,-3.9,3.8,1.9,
-3.5,-1.6,-3.5,-4.5,-0.8,-0.7,4.2,-0.9,-1.3)
# 2nd Hydrophobicity scale
hydrophobe.35 <-c(1.6,2,-9.2,-8.2,3.7,1,-3,3.1,-8.8,2.8,3.4,-4.8,
-0.2,-4.1,-12.3,0.6,1.2,2.6,1.9,-0.7)
# Surface area accessible to water in an unfolded peptide
saaH2O <-c(113,140,151,183,218,85,194,182,211,180,204,158,
143,189,241,122,146,160,259,229)
# Fraction of accessible area lost when a protein folds
faal.fold <-c(0.74,0.91,0.62,0.62,0.88,0.72,0.78,0.88,0.52,
0.85,0.85,0.63,0.64,0.62,0.64,0.66,0.7,0.86,0.85,0.76)
# Polar requirement
polar.req <-c(7,4.8,13,12.5,5,7.9,8.4,4.9,10.1,4.9,5.3,10,
6.6,8.6,9.1,7.5,6.6,5.6,5.2,5.4)
## relative frequency of occurrance
## "The frequencies column shows the mean
### percentage of each amino acid in the protein sequences
### of modern organisms"
freq <-c(7.8,1.1,5.19,6.72,4.39,6.77,2.03,6.95,6.32,
10.15,2.28,4.37,4.26,3.45,5.23,6.46,5.12,7.01,1.09,3.3)
# charges
## un = Un-charged
## neg = negative
## pos = positive
charge<-c('un','un','neg','neg','un','un','pos','un','pos',
'un','un','un','un','un','pos','un','un','un','un','un')
# hydropathy
hydropathy<-c('hydrophobic','hydrophobic','hydrophilic','
hydrophilic','hydrophobic','neutral','neutral','hydrophobic','hydrophilic','hydrophobic','hydrophobic','hydrophilic',' neutral',
'hydrophilic','hydrophilic','neutral','neutral',
'hydrophobic','hydrophobic','neutral')
# vol
volume.cat<-c('verysmall','small','small','medium',
'verylarge','verysmall','medium','large','large',
'large','large','small','small','medium','large','verysmall','small','medium','verylarge','verylarge')
# pol
polarity.cat<-c('nonpolar','nonpolar','polar','polar',
'nonpolar','nonpolar','polar','nonpolar',
'polar','nonpolar','nonpolar','polar','nonpolar','polar',
'polar','polar','polar','nonpolar','nonpolar','polar')
# chemical
chemical<-c('aliphatic','sulfur','acidic','acidic','aromatic',
'aliphatic','basic','aliphatic','basic','aliphatic','sulfur',
'amide','aliphatic','amide','basic','hydroxyl','hydroxyl',
'aliphatic','aromatic','aromatic')
Build the dataframes.
### Build dataframe
aa_dat <- data.frame(aa,MW.da,volume,bulkiness,
polarity,isoelectric.pt,hydrophobe.34,hydrophobe.35,
saaH2O,faal.fold, polar.req,freq,charge,hydropathy,volume.cat,polarity.cat,chemical)
aa_dat
## aa MW.da volume bulkiness polarity isoelectric.pt hydrophobe.34
## 1 A 89 67 11.50 0.00 6.00 1.8
## 2 C 121 86 13.46 1.48 5.07 2.5
## 3 D 133 91 11.68 49.70 2.77 -3.5
## 4 E 146 109 13.57 49.90 3.22 -3.5
## 5 F 165 135 19.80 0.35 5.48 2.8
## 6 G 75 48 3.40 0.00 5.97 -0.4
## 7 H 155 118 13.69 51.60 7.59 -3.2
## 8 I 131 124 21.40 0.13 6.02 4.5
## 9 K 146 135 15.71 49.50 9.74 -3.9
## 10 L 131 124 21.40 0.13 5.98 3.8
## 11 M 149 124 16.25 1.43 5.74 1.9
## 12 N 132 96 12.28 3.38 5.41 -3.5
## 13 P 115 90 17.43 1.58 6.30 -1.6
## 14 Q 147 114 14.45 3.53 5.65 -3.5
## 15 R 174 148 14.28 52.00 10.76 -4.5
## 16 S 105 73 9.47 1.67 5.68 -0.8
## 17 T 119 93 15.77 1.66 6.16 -0.7
## 18 V 117 105 21.57 0.13 5.96 4.2
## 19 W 204 163 21.67 2.10 5.89 -0.9
## 20 Y 181 141 18.03 1.61 5.66 -1.3
## hydrophobe.35 saaH2O faal.fold polar.req freq charge
## 1 1.6 113 0.74 7.0 7.80 un
## 2 2.0 140 0.91 4.8 1.10 un
## 3 -9.2 151 0.62 13.0 5.19 neg
## 4 -8.2 183 0.62 12.5 6.72 neg
## 5 3.7 218 0.88 5.0 4.39 un
## 6 1.0 85 0.72 7.9 6.77 un
## 7 -3.0 194 0.78 8.4 2.03 pos
## 8 3.1 182 0.88 4.9 6.95 un
## 9 -8.8 211 0.52 10.1 6.32 pos
## 10 2.8 180 0.85 4.9 10.15 un
## 11 3.4 204 0.85 5.3 2.28 un
## 12 -4.8 158 0.63 10.0 4.37 un
## 13 -0.2 143 0.64 6.6 4.26 un
## 14 -4.1 189 0.62 8.6 3.45 un
## 15 -12.3 241 0.64 9.1 5.23 pos
## 16 0.6 122 0.66 7.5 6.46 un
## 17 1.2 146 0.70 6.6 5.12 un
## 18 2.6 160 0.86 5.6 7.01 un
## 19 1.9 259 0.85 5.2 1.09 un
## 20 -0.7 229 0.76 5.4 3.30 un
## hydropathy volume.cat polarity.cat chemical
## 1 hydrophobic verysmall nonpolar aliphatic
## 2 hydrophobic small nonpolar sulfur
## 3 hydrophilic small polar acidic
## 4 \n hydrophilic medium polar acidic
## 5 hydrophobic verylarge nonpolar aromatic
## 6 neutral verysmall nonpolar aliphatic
## 7 neutral medium polar basic
## 8 hydrophobic large nonpolar aliphatic
## 9 hydrophilic large polar basic
## 10 hydrophobic large nonpolar aliphatic
## 11 hydrophobic large nonpolar sulfur
## 12 hydrophilic small polar amide
## 13 neutral small nonpolar aliphatic
## 14 hydrophilic medium polar amide
## 15 hydrophilic large polar basic
## 16 neutral verysmall polar hydroxyl
## 17 neutral small polar hydroxyl
## 18 hydrophobic medium nonpolar aliphatic
## 19 hydrophobic verylarge nonpolar aromatic
## 20 neutral verylarge polar aromatic
The code creates a correlation matrix
cor_ <- round(cor(aa_dat[,-c(1,13:17)]),2)
diag(cor_) <- NA
cor_[upper.tri(cor_)] <- NA
cor_
## MW.da volume bulkiness polarity isoelectric.pt hydrophobe.34
## MW.da NA NA NA NA NA NA
## volume 0.93 NA NA NA NA NA
## bulkiness 0.55 0.73 NA NA NA NA
## polarity 0.29 0.24 -0.20 NA NA NA
## isoelectric.pt 0.20 0.36 0.08 0.27 NA NA
## hydrophobe.34 -0.27 -0.08 0.44 -0.67 -0.20 NA
## hydrophobe.35 -0.25 -0.16 0.32 -0.85 -0.26 0.85
## saaH2O 0.97 0.99 0.64 0.29 0.35 -0.18
## faal.fold 0.11 0.18 0.49 -0.53 -0.18 0.84
## polar.req -0.05 -0.19 -0.53 0.76 -0.11 -0.79
## freq -0.52 -0.30 -0.04 -0.01 0.02 0.26
## hydrophobe.35 saaH2O faal.fold polar.req freq
## MW.da NA NA NA NA NA
## volume NA NA NA NA NA
## bulkiness NA NA NA NA NA
## polarity NA NA NA NA NA
## isoelectric.pt NA NA NA NA NA
## hydrophobe.34 NA NA NA NA NA
## hydrophobe.35 NA NA NA NA NA
## saaH2O -0.23 NA NA NA NA
## faal.fold 0.79 0.12 NA NA NA
## polar.req -0.87 -0.11 -0.81 NA NA
## freq -0.02 -0.38 -0.18 0.14 NA
In order to find where the max correlation occurs, the code below uses functions.The most positive correlation is between SaaH2O and volume, whereas polar.req and hydrophobe.35 have the most negative correlation.
which(cor_ == max(cor_, na.rm = T), arr.ind = T)
## row col
## saaH2O 8 2
panel.cor: Correlation coefficient panel for pairs function
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor,...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r)
}
The code below plots the scatter plot matrices.
plot(~MW.da+volume+polarity+hydrophobe.34+polar.req+bulkiness,
data = na.omit(aa_dat[,c("MW.da","volume","polarity","hydrophobe.34","polar.req","bulkiness")]),
panel = panel.smooth,
upper.panel =panel.cor )
We are able to analyze and observe the relationships between the 6 variables from this information in the scatterplot matrix.There is a strong positive correlation if the correlation value is greater or equal to 0.7.Whereas, there is a strong negative correlation if the value is less than or equal to -0.7.
Now, we will replicate Figure 2.8 from the Higgs and Atwood paper showing the relationship between amino acid volume and pI.
plot(polar.req ~ freq, data = aa_dat,
xlab = "pI",
ylab = "Volume (Å3)",
main = "Amino Acid Vol vs. pI ",
col = 0)
text(polar.req ~ freq,
labels = aa,
data = aa_dat,
col = 1:20)
The plot represents and shows the relationship between amino acid volume and pI. From this plot we can deduce the amino acid and its property based on where it is located within the figure.
This graph compares the relationship and correlation between the polar requirement and frequency. As a result, the color and size of each point is at the hand of the molecular weight and volume.
ggscatter(y = "polar.req",
x = "freq",
size = "MW.da",
color = "volume",
data = aa_dat,
xlab = "Polar Requirement",
ylab = "Frequency")
We will now identify two variables that have a non-linear relationship, which are volume and polar requirement.
ggscatter(data = aa_dat,
y = "volume",
x = "polar.req",
size = "MW.da",
color = "polarity",
add = "loess") +
xlab("polar requirment") +
ylab("volume")
## `geom_smooth()` using formula 'y ~ x'
##
geom_smooth() using formula ‘y ~ x’
We will want to re-scale the data using the natural log to make it more linear. To do so, we will need to assign new values in the data frame that entail the log versions of volume and polar.req.
aa_dat$volume_log <- log(aa_dat[,"volume"])
aa_dat$polar.req_log <- log(aa_dat[,"polar.req"])
The code below will create the log-scale scatter plot:
aa_dat$volume_log <- log10(aa_dat$volume)
aa_dat$polar.req_log <- log10(aa_dat$polar.req)
ggscatter(y = "volume_log",
x = "polar.req_log",
add = "reg.line",
size = "MW.da",
color = "polarity",
xlab = "Log10 polar requirement",
ylab = "Log10 volume",
data = aa_dat)
## `geom_smooth()` using formula 'y ~ x'
##
geom_smooth() using formula ‘y ~ x’
The code below will create a 3D scatter plot that demonstrates the relationship / correlation between MW.da, volume, and saaH20. These three variables have relatively strong correlations with one another.
scatterplot3d(x = aa_dat$MW.da,
y = aa_dat$volume,
z = aa_dat$saaH2Og,
xlab = "aa_dat$MW.da",
highlight.3d = TRUE,
type = "h")
Using the 3D scatter plot that was created above the relationship between MW.da, volume, and saaH20 can be clearly seen and defined. They demonstrate relatively strong relationships with each other on this type of plane and we can see that by looking at the lines that are created in relation to the other variables.
Principal component analysis, or PCA, is a statistical technique that is widely used to reduce the dimensionality of very large data sets. PCA is able to recalculate coordinates to match the new set of axes - essentially flattening the data down (reduction). It spreads the data out and can color code various variables and can take a 6D plot to a 2D one, with the greatest amount of variation.
Higgs and Atwood explain PCA as it has the ability to “shows that there is no particular similarity between amino acids in the same row” (Higgs and Attwood 2005, pg 4).
Now, we will create a dataset to drop any categorical variables and compare it to Table 2.2.
aa_dat2 <- aa_dat[,c("volume","bulkiness","polarity","isoelectric.pt","hydrophobe.34","hydrophobe.35","saaH2O","faal.fold")]
We will now run a PCA and create a PCA plot using a base R command and
pca.out <- prcomp(aa_dat2, scale = TRUE)
biplot(pca.out)
### Figure 8: Principal components analysis - vegan We will run the PCA using vegan and call the output rda.out.
rda.out <- vegan::rda(aa_dat2, scale = TRUE)
Scale = TRUE is used to properly and correctly scale numeric data using their respective standard deviations.
Next, we will extract the PCA score and call it rda_scores.
rda_scores <- scores(rda.out)
Now, we will create a biplot using the code below.
biplot(rda.out, display = "sites", col = 0,)
orditorp(rda.out, display = "sites", labels = aa_dat$aa, cex = 1.2)
ordihull(rda.out,
group = aa_dat$charge,
col = 1:7,
lty = 1:7,
lwd = c(3,6))
The plot that is created above varies greatly from that in the Higgs and Atwood paper. For one,the axes and scaling is a bit different than the plot that is shown in the paper. In addition, the name / label of the x and y axes are different, as well. Lastly, the orientation of the figure shown is nearly opposite of the of Figure 2.10.
Cluster analysis is a tool used in statistics that is composed of various algorithms that are ultimately used to classify different objects into similar groups, that assumes a constant rate of evolution. It’s done by comparing the similarity between two objects and if they are maximal they belong in the same group and vice versa. The overall purpose is to place objects into groups or “clusters”.
UPGMA is a type of clustering method that utilizes a bottom-up hierarchical clustering method. it stands for unweighted pair-group method with arithmetic mean. It is a distance method that requires a distance matrix, as a result. UPGMA is a clustering method that is used to generate dendrograms, among other types of visual plots.
aa_dat2 = aa_dat[,-c(1,2,11,12,13,14,15,16,17)]
row.names(aa_dat2) = paste(aa_dat$aa,sep = ".")
dist_euc = dist(aa_dat2,method = "euclidean")
clust_euc = hclust(dist_euc,method = "average") #sets it to use UPGMA
dendro_euc = as.dendrogram(clust_euc)
plot(dendro_euc,horiz = T,
nodePar = list(pch = c(1,NA),
cex = 0.5,
lab.cex = 1.2))
A matrix of distance is a table that is created that can show the distance between a pair of objects. It is a square matrix that takes into account distances, calculations in pairs and the property elements of a set. A matrix of Euclidean distances is a matrix of square distances between two points. It is slightly different from a normal distance matrix in the sense that as opposed to a regular distance matrix, the distances between two objects will be squared. This analysis of the cluster shows some variations between the one in the paper on Higgs and Atwood. Some of the amino acids, for example, are shifted farther / closer than those in the paper. As a consequence, this shows a distinction in the relative proximity of different amino acids.