This particular script is focused on replicating Higgs and Attwood’s analysis on the properties of 20 amino acids. Clustering is used to group different amino acids based on their similarities and differences. Some of the clustering methods used include PCA and UPGMA. This script helps us visualize the similarities and differences between amino acids, allowing scientists to learn more about shared properties, which can later be applied to futher studies of amino acids in several areas such as computational drug discovery.
## Warning: package 'ggplot2' was built under R version 4.0.3
## Warning: package 'ggpubr' was built under R version 4.0.3
## Warning: package 'vegan' was built under R version 4.0.3
## Loading required package: permute
## Warning: package 'permute' was built under R version 4.0.3
## Loading required package: lattice
## This is vegan 2.5-6
## Warning: package 'scatterplot3d' was built under R version 4.0.3
These data are derived mostly from Higgs (2009) Table 1. They are similar to Higgs and Attwood 2005 but have extra columns. Nathan Brouwer added additional categorical varibles based on online information.
Make the vectors.
# 1 letter code
aa <-c('A','C','D','E','F','G','H','I','K','L','M','N',
'P','Q','R','S','T','V','W','Y')
## molecular weight in dalton
MW.da <-c(89,121,133,146,165,75,155,131,146,131,149,132,115,147,174,105,119,117,204,181)
## vol from van der Waals radii
vol <-c(67,86,91,109,135,48,118,124,135,124,124,96,90,
114,148,73,93,105,163,141)
## bulk – a measure of the shape of the side chain
bulk <-c(11.5,13.46,11.68,13.57,19.8,3.4,13.69,21.4,
15.71,21.4,16.25,12.28,17.43,
14.45,14.28,9.47,15.77,21.57,21.67,18.03)
## pol – a measure of the electric field strength around the molecule
pol <-c(0,1.48,49.7,49.9,0.35,0,51.6,0.13,49.5,0.13,
1.43,3.38,1.58,3.53,52,1.67,1.66,0.13,2.1,1.61)
## isoelec point
isoelec <-c(6,5.07,2.77,3.22,5.48,5.97,7.59,6.02,9.74,5.98,
5.74,5.41,6.3,5.65,10.76,5.68,6.16,5.96,5.89,5.66)
## 1st Hydrophobicity scale
H2Opho.34 <-c(1.8,2.5,-3.5,-3.5,2.8,-0.4,-3.2,4.5,-3.9,3.8,1.9,
-3.5,-1.6,-3.5,-4.5,-0.8,-0.7,4.2,-0.9,-1.3)
## 2nd Hydrophobicity scale
H2Opho.35 <-c(1.6,2,-9.2,-8.2,3.7,1,-3,3.1,-8.8,2.8,3.4,-4.8,
-0.2,-4.1,-12.3,0.6,1.2,2.6,1.9,-0.7)
## Surface area accessible to water in an unfolded peptide
saaH2O <-c(113,140,151,183,218,85,194,182,211,180,204,158,
143,189,241,122,146,160,259,229)
## Fraction of accessible area lost when a protein folds
faal.fold <-c(0.74,0.91,0.62,0.62,0.88,0.72,0.78,0.88,0.52,
0.85,0.85,0.63,0.64,0.62,0.64,0.66,0.7,0.86,0.85,0.76)
# Polar requirement
polar.req <-c(7,4.8,13,12.5,5,7.9,8.4,4.9,10.1,4.9,5.3,10,
6.6,8.6,9.1,7.5,6.6,5.6,5.2,5.4)
## relative frequency of occurance
## "The frequencies column shows the mean
### percentage of each amino acid in the protein sequences
### of modern organisms"
freq <-c(7.8,1.1,5.19,6.72,4.39,6.77,2.03,6.95,6.32,
10.15,2.28,4.37,4.26,3.45,5.23,6.46,5.12,7.01,1.09,3.3)
##
# charges
## un = Un-charged
## neg = negative
## pos = positive
charge<-c('un','un','neg','neg','un','un','pos','un','pos',
'un','un','un','un','un','pos','un','un','un','un','un')
## hydropathy
hydropathy<-c('hydrophobic','hydrophobic','hydrophilic','
hydrophilic','hydrophobic','neutral','neutral',
'hydrophobic','hydrophilic','hydrophobic','hydrophobic',
'hydrophilic','neutral',
'hydrophilic','hydrophilic','neutral','neutral',
'hydrophobic','hydrophobic','neutral')
## vol
vol.cat<-c('verysmall','small','small','medium',
'verylarge','verysmall','medium','large','large',
'large','large','small','small','medium','large',
'verysmall',
'small','medium','verylarge','verylarge')
## pol
pol.cat<-c('nonpolar','nonpolar','polar','polar',
'nonpolar','nonpolar','polar','nonpolar',
'polar','nonpolar','nonpolar','polar','nonpolar','polar',
'polar','polar','polar','nonpolar','nonpolar','polar')
## chemical
chemical<-c('aliphatic','sulfur','acidic','acidic','aromatic',
'aliphatic','basic','aliphatic','basic','aliphatic','sulfur',
'amide','aliphatic','amide','basic','hydroxyl','hydroxyl',
'aliphatic','aromatic','aromatic')
Build the dataframes. This code generates a data frame withe all the information that can be analyzed later using other functions.
### Build dataframe
aa_dat <- data.frame(aa,
MW.da,vol,
bulk, pol,
isoelec,H2Opho.34, H2Opho.35,
saaH2O, faal.fold, polar.req,
freq, charge, hydropathy,
vol.cat, pol.cat, chemical)
This code generates a correlation matrix.
cor_ <- round(cor(aa_dat[,-c(1,13:17)]),2)
diag(cor_) <- NA
cor_[upper.tri(cor_)] <- NA
cor_
## MW.da vol bulk pol isoelec H2Opho.34 H2Opho.35 saaH2O faal.fold
## MW.da NA NA NA NA NA NA NA NA NA
## vol 0.93 NA NA NA NA NA NA NA NA
## bulk 0.55 0.73 NA NA NA NA NA NA NA
## pol 0.29 0.24 -0.20 NA NA NA NA NA NA
## isoelec 0.20 0.36 0.08 0.27 NA NA NA NA NA
## H2Opho.34 -0.27 -0.08 0.44 -0.67 -0.20 NA NA NA NA
## H2Opho.35 -0.25 -0.16 0.32 -0.85 -0.26 0.85 NA NA NA
## saaH2O 0.97 0.99 0.64 0.29 0.35 -0.18 -0.23 NA NA
## faal.fold 0.11 0.18 0.49 -0.53 -0.18 0.84 0.79 0.12 NA
## polar.req -0.05 -0.19 -0.53 0.76 -0.11 -0.79 -0.87 -0.11 -0.81
## freq -0.52 -0.30 -0.04 -0.01 0.02 0.26 -0.02 -0.38 -0.18
## polar.req freq
## MW.da NA NA
## vol NA NA
## bulk NA NA
## pol NA NA
## isoelec NA NA
## H2Opho.34 NA NA
## H2Opho.35 NA NA
## saaH2O NA NA
## faal.fold NA NA
## polar.req NA NA
## freq 0.14 NA
This code outputs the column/row with the most positive and negative correlation. SaaH2O and vol have the most positive correlation of 0.99. Polar.req and H2Opho.35 have the most negative correlation of -0.87.
which(cor_ == max(cor_, na.rm = T), arr.ind = T)
## row col
## saaH2O 8 2
which(cor_ == min(cor_, na.rm = T), arr.ind = T)
## row col
## polar.req 10 7
This function allows for an analysis of correlation between variables x and y (positive, negative, or no correlation).
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor,...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r)
}
This code generates a scatterplot matrix.
plot( ~ MW.da + vol + pol + H2Opho.34 + polar.req + saaH2O + faal.fold,
data = na.omit(aa_dat[,c("MW.da","vol","pol","H2Opho.34","polar.req","saaH2O","faal.fold")]),
panel = panel.smooth,
upper.panel =panel.cor )
This generates a graph showing the relationship between Amino Acid Volume and pI.
#sets up blank plot and labels
plot(polar.req ~ freq, data = aa_dat,
xlab = "pI",
ylab = "Volume",
main = "Amino Acid Volume vs. pI ",
col = 0)
#tilda uses plot() to plot the data points
text(polar.req ~ freq,
labels = aa,
data = aa_dat,
col = 1:20)
ggscatter(y = "vol",
x = "isoelec",
size = "MW.da",
color = "bulk",
data = aa_dat,
xlab = "Isoelectric Point",
ylab = "Volume")
ggscatter(y = "saaH2O",
x = "polar.req",
color = "bulk",
size = "MW.da",
add = "loess",
data = aa_dat,
xlab = "Polar requirement",
ylab = "Surface area accessible to water in an unfolded peptide")
## `geom_smooth()` using formula 'y ~ x'
Log allows us to look at non-linear data from a linear perspective by changing the scale.
aa_dat$saaH2O_log <- log(aa_dat[,"saaH2O"])
aa_dat$polar.req_log <- log(aa_dat[,"polar.req"])
ggscatter(y = "saaH2O",
x = "polar.req_log",
add = "reg.line",
size = "MW.da",
color = "bulk",
xlab = "Log10 polar requirement",
ylab = "Log10 saaH2O",
data = aa_dat)
## `geom_smooth()` using formula 'y ~ x'
This code generates a 3D scatterplot, showing the relationship between frequency, polar requirement, and isoelectric point.
scatterplot3d(x = aa_dat$polar.req,
y = aa_dat$isoelec,
z = aa_dat$freq,
highlight.3d = TRUE,
type = "h")
Principal Component Analysis (PCA) is a method of cluster data together based on their similarity to reduce the dimensions of large data sets by scaling them down. The authors describe it as a way to visualize multidimensional data, focusing on its important properties (Higgs and Attwood 2005 pg 34).
PCA plots were made using R’s packages and manually through distance matrices.
This code creates a modified version of the original data frame without the categorical variables.
aa_dat2 = data.frame(vol, bulk, pol, isoelec, H2Opho.34, H2Opho.35, saaH2O, faal.fold)
Makes PCA plot using base R.
pca.out <- prcomp(aa_dat2, scale = TRUE)
biplot(pca.out)
Runs PCA with vegan. “Scale = TRUE” sets the scale based on the numeric data
rda.out <- vegan::rda(aa_dat2, scale = TRUE)
Extracts PCA scores as rda_scores
rda_scores <- scores(rda.out)
Creates biplot(PCA plot)
biplot(rda.out, display = "sites", col = 0,)
orditorp(rda.out, display = "sites", labels = aa_dat$aa, cex = 1.2)
ordihull(rda.out,
group = aa_dat$charge,
col = 1:7,
lty = 1:7,
lwd = c(3,6))
Some of the biggest differences between this plot and Higgs and Attwood’s Figure 2.10 is the orientation of the plot. In addition, the scaling of both x and y axes is different.
Cluster analysis is used to group objects based on their similarity by analyzing them through a property. Object that are closer together form “clusters”, allowing us to deduce that they are similar to each other based on distance.
UPGMA is a type of method used for cluster analysis. Standing for unweighted pair group method with arithmetic mean, it find the lowest mean distance between two objects and groups them together. It repeats the process until all objects grouped together, which can be used to plot a tree based on distances.
aa_dat2 = aa_dat[,-c(1,2,11,12,13,14,15,16,17)]
row.names(aa_dat2) = paste(aa_dat$aa,sep = ".")
dist_euc = dist(aa_dat2,method = "euclidean")
clust_euc = hclust(dist_euc,method = "average") #average implies UPGMA
dendro_euc = as.dendrogram(clust_euc)
plot(dendro_euc,horiz = T,
nodePar = list(pch = c(1,NA),
cex = 0.5,
lab.cex = 1.2))
Distance Matrices show distance between objects based on their similarities. Higher numbers in the table imply that the objects aren’t very similar to each other while zero implies that the objects are identical.
Euclidean distance refers to difference between a two points on a 2D line. Trees are plotted with lines and the distance between two objects can be identified by the euclidean distances.
While this dendogram and Higgs and Attwood’s dendogram is quite similar, there are some differences between the two. Some of the clades are arranged differently due to differences in amino acid properties.