This script forms a analysis of the 20 Amino Acids and their various data sets and properties. It takes those properties and categorizes the 20 Amino Acids based on similarities and differences, because similar amino acids tend to have the same functions so we can cluster them together. This script uses PCA, UPGMA, Scatterplot matrixes and other R functions.It is a relication of the work done by Higgs and Atwood “Amino Acid Chemistry” paper.
ggplot2 creates elegant and better visualized plots than the default R plots, ggpubr allows us to add additional modifications and formatting to our ggplot2 plots.
library(ggpubr) #is a package that creates "wrappers" for much of ggplot2's core functionality
## Warning: package 'ggpubr' was built under R version 4.0.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.0.3
library(ggplot2) #Creates elegant data visualizations and plots using the grammar of graphics
library(permute) #Functions for generating restricted permuations of data
## Warning: package 'permute' was built under R version 4.0.3
library(lattice) #a powerful and elegant high-level data visualization system
library(vegan) #package used for multivariable analysis
## Warning: package 'vegan' was built under R version 4.0.3
## This is vegan 2.5-6
library(scatterplot3d) #used to make 3D plots
## Warning: package 'scatterplot3d' was built under R version 4.0.3
this data represents all the amino acids found in nature. it details each amino acid and their properties, such as bulk, polarity, molecular weight etc.
# 1 letter code
aa <-c('A','C','D','E','F','G','H','I','K','L','M','N',
'P','Q','R','S','T','V','W','Y')
## molecular weight in dalton
MW.da <-c(89,121,133,146,165,75,155,131,146,131,149,132,115,147,174,105,119,117,204,181)
## vol from van der Waals radii
vol <-c(67,86,91,109,135,48,118,124,135,124,124,96,90,
114,148,73,93,105,163,141)
## bulk – a measure of the shape of the side chain
bulk <-c(11.5,13.46,11.68,13.57,19.8,3.4,13.69,21.4,
15.71,21.4,16.25,12.28,17.43,
14.45,14.28,9.47,15.77,21.57,21.67,18.03)
## pol – a measure of the electric field strength around the molecule
pol <-c(0,1.48,49.7,49.9,0.35,0,51.6,0.13,49.5,0.13,
1.43,3.38,1.58,3.53,52,1.67,1.66,0.13,2.1,1.61)
## isoelec point
isoelec <-c(6,5.07,2.77,3.22,5.48,5.97,7.59,6.02,9.74,5.98,
5.74,5.41,6.3,5.65,10.76,5.68,6.16,5.96,5.89,5.66)
## 1st Hydrophobicity scale
H2Opho.34 <-c(1.8,2.5,-3.5,-3.5,2.8,-0.4,-3.2,4.5,-3.9,3.8,1.9,
-3.5,-1.6,-3.5,-4.5,-0.8,-0.7,4.2,-0.9,-1.3)
## 2nd Hydrophobicity scale
H2Opho.35 <-c(1.6,2,-9.2,-8.2,3.7,1,-3,3.1,-8.8,2.8,3.4,-4.8,
-0.2,-4.1,-12.3,0.6,1.2,2.6,1.9,-0.7)
## Surface area accessible to water in an unfolded peptide
saaH2O <-c(113,140,151,183,218,85,194,182,211,180,204,158,
143,189,241,122,146,160,259,229)
## Fraction of accessible area lost when a protein folds
faal.fold <-c(0.74,0.91,0.62,0.62,0.88,0.72,0.78,0.88,0.52,
0.85,0.85,0.63,0.64,0.62,0.64,0.66,0.7,0.86,0.85,0.76)
# Polar requirement
polar.req <-c(7,4.8,13,12.5,5,7.9,8.4,4.9,10.1,4.9,5.3,10,
6.6,8.6,9.1,7.5,6.6,5.6,5.2,5.4)
## relative frequency of occurance
## "The frequencies column shows the mean
### percentage of each amino acid in the protein sequences
### of modern organisms"
freq <-c(7.8,1.1,5.19,6.72,4.39,6.77,2.03,6.95,6.32,
10.15,2.28,4.37,4.26,3.45,5.23,6.46,5.12,7.01,1.09,3.3)
# charges
## un = Un-charged
## neg = negative
## pos = positive
charge<-c('un','un','neg','neg','un','un','pos','un','pos',
'un','un','un','un','un','pos','un','un','un','un','un')
## hydropathy
hydropathy<-c('hydrophobic','hydrophobic','hydrophilic','
hydrophilic','hydrophobic','neutral','neutral',
'hydrophobic','hydrophilic','hydrophobic','hydrophobic',
'hydrophilic','neutral',
'hydrophilic','hydrophilic','neutral','neutral',
'hydrophobic','hydrophobic','neutral')
## vol
vol.cat<-c('verysmall','small','small','medium',
'verylarge','verysmall','medium','large','large',
'large','large','small','small','medium','large',
'verysmall',
'small','medium','verylarge','verylarge')
## pol
pol.cat<-c('nonpolar','nonpolar','polar','polar',
'nonpolar','nonpolar','polar','nonpolar',
'polar','nonpolar','nonpolar','polar','nonpolar','polar',
'polar','polar','polar','nonpolar','nonpolar','polar')
## chemical
chemical<-c('aliphatic','sulfur','acidic','acidic','aromatic',
'aliphatic','basic','aliphatic','basic','aliphatic','sulfur',
'amide','aliphatic','amide','basic','hydroxyl','hydroxyl',
'aliphatic','aromatic','aromatic')
#Building data frame
aa_dat <- data.frame(aa,MW.da,vol,bulk, pol,isoelec, H2Opho.34, H2Opho.35, saaH2O, faal.fold, polar.req, freq, charge, hydropathy, vol.cat, pol.cat, chemical)
cor_ <- round(cor(aa_dat[,-c(1,13:17)]),2)
diag(cor_) <- NA
cor_[upper.tri(cor_)] <- NA
cor_
## MW.da vol bulk pol isoelec H2Opho.34 H2Opho.35 saaH2O faal.fold
## MW.da NA NA NA NA NA NA NA NA NA
## vol 0.93 NA NA NA NA NA NA NA NA
## bulk 0.55 0.73 NA NA NA NA NA NA NA
## pol 0.29 0.24 -0.20 NA NA NA NA NA NA
## isoelec 0.20 0.36 0.08 0.27 NA NA NA NA NA
## H2Opho.34 -0.27 -0.08 0.44 -0.67 -0.20 NA NA NA NA
## H2Opho.35 -0.25 -0.16 0.32 -0.85 -0.26 0.85 NA NA NA
## saaH2O 0.97 0.99 0.64 0.29 0.35 -0.18 -0.23 NA NA
## faal.fold 0.11 0.18 0.49 -0.53 -0.18 0.84 0.79 0.12 NA
## polar.req -0.05 -0.19 -0.53 0.76 -0.11 -0.79 -0.87 -0.11 -0.81
## freq -0.52 -0.30 -0.04 -0.01 0.02 0.26 -0.02 -0.38 -0.18
## polar.req freq
## MW.da NA NA
## vol NA NA
## bulk NA NA
## pol NA NA
## isoelec NA NA
## H2Opho.34 NA NA
## H2Opho.35 NA NA
## saaH2O NA NA
## faal.fold NA NA
## polar.req NA NA
## freq 0.14 NA
The code below explores which columns have the most positive correlation and which have the most negative correlation between them.
SaaH2O and Vol have the most postive correlation. Polar.req and H2Ophobe.35 have the most negative
which(cor_ == max(cor_, na.rm = T), arr.ind = T)
## row col
## saaH2O 8 2
max(cor_ , na.rm = T)
## [1] 0.99
which(cor_ == min(cor_, na.rm = T), arr.ind = T)
## row col
## polar.req 10 7
min(cor_ , na.rm = T)
## [1] -0.87
pairs(~MW.da + vol + pol + H2Opho.34 + polar.req + saaH2O + faal.fold,
data= aa_dat)
This function allows us to view the correlation between our varialbles with 0 being no correlation and 1 being highest correlation. It also makes the high correlation numbers bigger and makes the low correlation numbers smaller for easier visual. It also creates a line of best fit for the graphs
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor,...) #5 arguments
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r)
}
pairs(~MW.da + vol + pol + H2Opho.34 + polar.req + saaH2O + faal.fold,
data= aa_dat,
lower.panel = panel.smooth ,
upper.panel = panel.cor ,
gap = 0
)
We can conclude that there are strong correlations between: MW.da and vol, MW.da and saaH2O, Vol and saaH2O, each of them have a correlation of <90, there are other strong correlations (80-89). some weak correlations are: polar.req and MW.da, andsaaH2O and polar.req, H2Ophobe.34 and vol. Some examples of non-linear relationships are:Polar.req and faal.fold , pol and any other variable (non have linear relationship)
The plot showcases the relationship between the Volume and the Isoelectric point variables. this graph also indicates that acidic amino acids have a low Isoelectric point, while basic Amino Acids have a high Isoelectric point
plot(vol~isoelec, data= aa_dat, #this command plots the actual graph
xlab = "Pl",
ylab = "Volume",
main = "Plot of Amino Acid Volume Against Pl",
col = 0)
text(vol ~ isoelec, #this command is adding text labels as seen by labels = aa, aa is the 1 letter abbreviation of our amino acids
labels = aa,
data = aa_dat,
col = 1:20)
ggscatter(y = "vol",
x = "isoelec",
size = "MW.da",
color = "polar.req",
data = aa_dat,
xlab = "Isoelectric Point",
ylab = "Volume")
creating a graph out of 2 non-linear variables, while also using a smoother to showcases that the relationship is not linear.
ggscatter(data = aa_dat,
y = "vol",
x = "H2Opho.34",
size = "MW.da",
color = "pol",
add = "loess") +
xlab("H2O Phobia")+
ylab("Volume")
## `geom_smooth()` using formula 'y ~ x'
The log functions can transform non-linear relationships to linear one by putting them in a logarithmic scale, we are able to examine correlation better with this scale
aa_dat$log_vol<-log(vol)
aa_dat$log_H2Opho.34<-log(H2Opho.34)
## Warning in log(H2Opho.34): NaNs produced
ggscatter(data = aa_dat,
y = "log_vol",
x = "log_H2Opho.34",
size = "bulk",
color = "freq",
add = "reg.line") +
xlab("log of H2O phobia") +
ylab("log of Volume")
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 13 rows containing non-finite values (stat_smooth).
## Warning: Removed 13 rows containing missing values (geom_point).
The 3D graph does not seem to show correlation between the 3 different variables since they all lie in different areas of the graph. a strong correlation would be if the majority of their datas are grouped in the same spot
scatterplot3d(x = aa_dat$saaH2O,
y = aa_dat$MW.da,
z = aa_dat$svol,
highlight.3d = TRUE,
angle = 100,
type="h")
Principal Component Analysis, shortened to PCA, is a analysis method used commonly by computational biology researchers. It’s basic idea is plotting multileveled data (such as 8 variables) and scaling them down to a 2D graph which showcases related variables clustered together. Higgs Atwood, who is a renowned computational biology professor, wrote in his book “The PCA method transforms this cloud of points first by scaling them and shifting them to the origin, and then by rotating them in such a way that the points are spread out as much as possible, and the structure in the data is made easier to see.” We have two ways of utilizing PCA in R. Either through the base PCA function in R, or through the vegan package which contains the function rda().
In our Dataset, we have 11 variables that contain properties of amino acids, including non numeric ones. our first step is scaling back this large dataset to 8 variables which are mentioned in Higgs and Attwood data table. coincidentally only numeric data is in the table
aa_dat2 = data.frame(vol, bulk, pol, isoelec,H2Opho.34, H2Opho.35, saaH2O, faal.fold)
Run a PCA and create a PCA plot using base R command.
pca.out <- prcomp(aa_dat2[,-c(7,8)], scale = TRUE)
biplot(pca.out)
Run the PCA analysis through vegan
rda.out <- vegan::rda(aa_dat2, scale = TRUE) #eigenvalues to display the importance of axes and to describe the true configuration of points. that means they change by a scaler value
Extract the known PCA score
rda_scores <- scores(rda.out)
Make a “biplot” (I usually call this a PCA plot
biplot(rda.out, display = "sites", col = 0,)
orditorp(rda.out, display = "sites", labels = aa_dat$aa, cex = 1.2)
ordihull(rda.out,
group = aa_dat$hydropathy,
col = 1:7,
lty = 1:7,
lwd = c(3,6))
The figure in Higgs Attwood (figure 2.10) is different than our figure for example our component numbers are different, and using vegan, we highlighted the most likely grouping while in Higgs plot no highlighting is made.
Cluster analysis allows to create meaningful subgroups of large data sets, these groups are highly correlated together and have similar properties.
The UPGMA algorithm constructs a rooted tree that reflects the structure present in a pairwise dismimilarty matrix. At each step, the nearest two clusters are combined into a higher-level cluster. The distance between any two clusters is taken to the mean distance of all elements in each cluster that is made
Adding the row names
row.names(aa_dat2) <- paste(aa_dat$aa, sep = "")
row.names(aa_dat2)
## [1] "A" "C" "D" "E" "F" "G" "H" "I" "K" "L" "M" "N" "P" "Q" "R" "S" "T" "V" "W"
## [20] "Y"
A distance matrix creates a two dimensional mattrix, in which distance represents the similarities or differences between each point, with similarities being a high number on a 0 to 1 scale and differences being vice-versa
Euclidean is a geometric analysis which seeks to understand the geometry of flat, two-dimensional spaces in which any 2 points can be joined by a straight line. it follows Euclid postulates
dist_euc <- dist(aa_dat2[,-9],
method = "euclidean")
A cluster analysis can be done with hclust() function with the UPGMA algorithm.
clust_euc <- hclust(dist_euc, method = "average") #the method "average" performs UPGMA cluster analysis
Plot the dendrogram
par(mar = c(1,1,1,1))
dendro_euc <- as.dendrogram(clust_euc)
plot(dendro_euc,horiz = T,nodePar = list(pch = c(1,NA),
cex = 0.7,
lab.cex = 1))