In this script, we are showing that there are several similarities and differences between amino acids and their properties. We build a data frame then create similarity matrices, scatterplots, and PCAs which allow for a further analysis of the similarities between amino acids. This is important biologically because it allows us to see how amino acids differ from one another and which amino acids are more genetically similar compared to others. Further this will help us in biology because then we can determine that if amino acids change in specific scenarios, then we can better predict what those changes may effect.
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.5-6
These data are derived mostly from Higgs (2009) Table 1. They are similar to Higgs and Attwood 2005 but have extra columns. Nathan Brouwer added additional categorical varibles based on online information.
This data represents the amino acids and their properties.
Make the vectors.
# 1 letter code
aa <-c('A','C','D','E','F','G','H','I','K','L','M','N',
'P','Q','R','S','T','V','W','Y')
## molecular weight in dalton
MW.da <-c(89,121,133,146,165,75,155,131,146,131,149,132,115,147,174,105,119,117,204,181)
## vol from van der Waals radii
vol <-c(67,86,91,109,135,48,118,124,135,124,124,96,90,
114,148,73,93,105,163,141)
## bulk – a measure of the shape of the side chain
bulk <-c(11.5,13.46,11.68,13.57,19.8,3.4,13.69,21.4,
15.71,21.4,16.25,12.28,17.43,
14.45,14.28,9.47,15.77,21.57,21.67,18.03)
## pol – a measure of the electric field strength around the molecule
pol <-c(0,1.48,49.7,49.9,0.35,0,51.6,0.13,49.5,0.13,
1.43,3.38,1.58,3.53,52,1.67,1.66,0.13,2.1,1.61)
## isoelec point
isoelec <-c(6,5.07,2.77,3.22,5.48,5.97,7.59,6.02,9.74,5.98,
5.74,5.41,6.3,5.65,10.76,5.68,6.16,5.96,5.89,5.66)
## 1st Hydrophobicity scale
H2Opho.34 <-c(1.8,2.5,-3.5,-3.5,2.8,-0.4,-3.2,4.5,-3.9,3.8,1.9,
-3.5,-1.6,-3.5,-4.5,-0.8,-0.7,4.2,-0.9,-1.3)
## 2nd Hydrophobicity scale
H2Opho.35 <-c(1.6,2,-9.2,-8.2,3.7,1,-3,3.1,-8.8,2.8,3.4,-4.8,
-0.2,-4.1,-12.3,0.6,1.2,2.6,1.9,-0.7)
## Surface area accessible to water in an unfolded peptide
saaH2O <-c(113,140,151,183,218,85,194,182,211,180,204,158,
143,189,241,122,146,160,259,229)
## Fraction of accessible area lost when a protein folds
faal.fold <-c(0.74,0.91,0.62,0.62,0.88,0.72,0.78,0.88,0.52,
0.85,0.85,0.63,0.64,0.62,0.64,0.66,0.7,0.86,0.85,0.76)
# Polar requirement
polar.req <-c(7,4.8,13,12.5,5,7.9,8.4,4.9,10.1,4.9,5.3,10,
6.6,8.6,9.1,7.5,6.6,5.6,5.2,5.4)
## relative frequency of occurance
## "The frequencies column shows the mean
### percentage of each amino acid in the protein sequences
### of modern organisms"
freq <-c(7.8,1.1,5.19,6.72,4.39,6.77,2.03,6.95,6.32,
10.15,2.28,4.37,4.26,3.45,5.23,6.46,5.12,7.01,1.09,3.3)
##
# charges
## un = Un-charged
## neg = negative
## pos = positive
charge<-c('un','un','neg','neg','un','un','pos','un','pos',
'un','un','un','un','un','pos','un','un','un','un','un')
## hydropathy
hydropathy<-c('hydrophobic','hydrophobic','hydrophilic','
hydrophilic','hydrophobic','neutral','neutral',
'hydrophobic','hydrophilic','hydrophobic','hydrophobic',
'hydrophilic','neutral',
'hydrophilic','hydrophilic','neutral','neutral',
'hydrophobic','hydrophobic','neutral')
## vol
vol.cat<-c('verysmall','small','small','medium',
'verylarge','verysmall','medium','large','large',
'large','large','small','small','medium','large',
'verysmall',
'small','medium','verylarge','verylarge')
## pol
pol.cat<-c('nonpolar','nonpolar','polar','polar',
'nonpolar','nonpolar','polar','nonpolar',
'polar','nonpolar','nonpolar','polar','nonpolar','polar',
'polar','polar','polar','nonpolar','nonpolar','polar')
## chemical
chemical<-c('aliphatic','sulfur','acidic','acidic','aromatic',
'aliphatic','basic','aliphatic','basic','aliphatic','sulfur',
'amide','aliphatic','amide','basic','hydroxyl','hydroxyl',
'aliphatic','aromatic','aromatic')
Build the dataframes.
### Build dataframe
aa_dat <- data.frame(aa,
MW.da,vol,
bulk, pol,
isoelec,H2Opho.34, H2Opho.35,
saaH2O, faal.fold, polar.req,
freq, charge, hydropathy,
vol.cat, pol.cat, chemical)
Finds the Correlation - similarity and dissimilarity - between the different variables in the data frame.
cor_ <- round(cor(aa_dat[,-c(1,13:17)]),2)
diag(cor_) <- NA
cor_[upper.tri(cor_)] <- NA
cor_
## MW.da vol bulk pol isoelec H2Opho.34 H2Opho.35 saaH2O faal.fold
## MW.da NA NA NA NA NA NA NA NA NA
## vol 0.93 NA NA NA NA NA NA NA NA
## bulk 0.55 0.73 NA NA NA NA NA NA NA
## pol 0.29 0.24 -0.20 NA NA NA NA NA NA
## isoelec 0.20 0.36 0.08 0.27 NA NA NA NA NA
## H2Opho.34 -0.27 -0.08 0.44 -0.67 -0.20 NA NA NA NA
## H2Opho.35 -0.25 -0.16 0.32 -0.85 -0.26 0.85 NA NA NA
## saaH2O 0.97 0.99 0.64 0.29 0.35 -0.18 -0.23 NA NA
## faal.fold 0.11 0.18 0.49 -0.53 -0.18 0.84 0.79 0.12 NA
## polar.req -0.05 -0.19 -0.53 0.76 -0.11 -0.79 -0.87 -0.11 -0.81
## freq -0.52 -0.30 -0.04 -0.01 0.02 0.26 -0.02 -0.38 -0.18
## polar.req freq
## MW.da NA NA
## vol NA NA
## bulk NA NA
## pol NA NA
## isoelec NA NA
## H2Opho.34 NA NA
## H2Opho.35 NA NA
## saaH2O NA NA
## faal.fold NA NA
## polar.req NA NA
## freq 0.14 NA
Which variables have the most postive correlation? What have most negative correlation?
The code below is determining which of the variables in the similarity matrix have the most positive correlation and the rows in which it is. The most positive correlation is the number which is closest to 1 (0.99) and the most negative correlation is the number which is closest to -1 (-0.87). The variables which have the most positive correlation is saaH20 and MW.da and the most negative correlation is polar.req and H2Opho.35.
which(cor_ == max(cor_, na.rm = T), arr.ind = T)
## row col
## saaH2O 8 2
which(cor_ == min(cor_, na.rm = T), arr.ind = T)
## row col
## polar.req 10 7
print(max(cor_, na.rm = T))
## [1] 0.99
print(min(cor_, na.rm = T))
## [1] -0.87
The code below takes the six variables “MW.da”,“vol”,“pol”,“H2Opho.34”,“polar.req”, “saaH2O”, “faal.fold” and put them into a scatter plot matrix showing the scatterplot on the bottom left triangle and the upper right triangle has their correlations.
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor,...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r)
}
In the created scatter plot matrix the saaH20 and the vol have the greatest, most positive correlation. Polar.req and MW.da have the lowest correlation of 0.05. There are several that do not have linear relationships such as MW.da - pol and polar.req - MW.da.
plot(aa_dat[, c("MW.da","vol","pol","H2Opho.34","polar.req", "saaH2O", "faal.fold")],
panel = panel.smooth,
upper.panel =panel.cor ,
main = "ScatterPlot Matrix")
This plot looks at the amino acid properties of the polar requirement and the frequency of occurrences. The letters on the graph are the indications for which amino acid it is talking about.
This is taking the same variables used above but making it into a scatter plot with ggpubr. With using color and size to indicate two additional variables including the ones being used.
ggscatter(y = "polar.req",
x = "freq",
size = "saaH2O",
color = "bulk",
data = aa_dat,
xlab = "freq",
ylab = "polar.req")
In this graph, the two variables used are polar.req and vol because of their non-liner relationship. I used a loess smoother which is function of ggpubr which allows for a better prediction of the line of best fit. I also used the variables freq and saaH2O for the color and size analysis to add 2 other numeric variables.
ggscatter(y = "polar.req",
x = "vol",
data = aa_dat,
size = "freq",
color = "saaH2O",
add = "loess", #adds smoother through ggpubr
xlab = "vol",
ylab = "polar.req")
## `geom_smooth()` using formula 'y ~ x'
This graph uses the same variables as the previous graph but changes them to use a log-based scale rather than their regular scale. Using the log based scale allows for the graph to use smaller values that are closer together rather than using bigger values which make the graph harder to interpret.
aa_dat$polar.req_log <- log(aa_dat$polar.req) #creating a new column with polar.req_log
aa_dat$vol_log <- log(aa_dat$vol) #creating a new column with vol_log
ggscatter(y = "polar.req_log",
x = "vol_log",
data = aa_dat,
size = "freq",
color = "freq",
add = "loess", #adds smoother through ggpubr
xlab = "vol_log",
ylab = "polar.req_log",
)
## `geom_smooth()` using formula 'y ~ x'
The three variables I chose for this 3D scatterplot are MW.da, vol, an saaH2O. I chose these variables because they have strong correlations with each other. The correlation between MW.da and vol is 0.93, MW.da and saaH2O is 0.97, and vol and saaH2O is 0.99.
scatterplot3d(x = aa_dat$MW.da,
y = aa_dat$vol,
z = aa_dat$saaH2O,
highlight.3d = TRUE,
type = "h")
Subsetting Dataframe manually by using the names of the column.
aa_dat2 <- aa_dat[,c("vol", "bulk", "pol", "isoelec", "H2Opho.34", "H2Opho.35",
"saaH2O", "faal.fold")]
“The PCA method transforms this cloud of points first by scaling them and shifting them to the origin, and then by rotating them in such a way that the points are spread out as much as possible, and the structure in the data is made easier to see” (Higgs and Attwood 2005, pg 26). In this class we use PCA as a dimension reduction method. The two basic ways we create a PCA plot in this class is using base R and by using vegan. The PCA using vegan is sometimes easier to comprehend because it is easier to look at graphically and is not as “clunky” as PCA using R.
Running of the PCA and creating a PCA plot using base R.
pca.out <- prcomp(aa_dat2, scale = TRUE)
Plotting the output of the PCA.
biplot(pca.out)
Running a PCA using vegan instead of base R.
rda.out <- vegan::rda(aa_dat2,
scale = TRUE) # scale makes the scale more concise uses smaller intervals. If you were to put scale = False the scale would be from -10 - 10 rather than -3 - 3 from the x axis.
rda_scores <- scores(rda.out) #extract PCA scores
biplot(rda.out, display = "sites", col = 0,)
orditorp(rda.out, display = "sites", labels = aa_dat$aa, cex = 1.2)
ordihull(rda.out,
group = aa_dat$hydropathy,
col = 1:7,
lty = 1:7,
lwd = c(3,6))
The above plot is the plot from Higgs and Attwood using hydropathy and the below plot uses pol.cat. The main differences between the two plots are that the plot from Higgs uses 3 categories [hydrophobic, hydrophilic, and neutral] and the variable I used only have 2 categories [polar and nonpolar]; therefore, in the plots the Higgs plot has 3 circles and my plot have only 2 circles correlating to the number of categories.
biplot(rda.out, display = "sites", col = 0,)
orditorp(rda.out, display = "sites", labels = aa_dat$aa, cex = 1.2)
ordihull(rda.out,
group = aa_dat$pol.cat,
col = 1:7,
lty = 1:7,
lwd = c(3,6))
The purpose of a cluster analysis is to analyze the similarities between two amino acids. It will show which amino acids are closely related compared to ones that are further apart allowing to “cluster” and “group” the variables.
The distance matrix shows the pairwise distances between two structures. It shows which variables are most similar and which are not. The euclidean distance is used because it determines the distance between two “cells.” It allows for a universal way of calculating the distances between two objects in the distance matrix.
The cluster dendrogram created here is a little different from the one created in the the Higgs paper because the amino acids are rearranged a little differently and in the Higgs articles, they have a better image on what their dendrogram is meaning. The dendrogram created here is similar to the one created by Brouwer in that they have the same rearrangement of amino acids, but mine is not oriented the same as it.
row.names(aa_dat2) <- paste(aa_dat2$Species,1:nrow(aa_dat2),sep = ".")
dist_euc <- dist(aa_dat2, method = "euclidean")
clust_euc <- hclust(dist_euc, method = "average") # method = "average" allows for the UPGMA
plot(clust_euc, labels = aa_dat$aa)