The script below attempts to recreate the research of Higgs and Attwood about amino acids. Various scatterplots and dendrograms are created in order to understand different relationships between the variables. This is important to study because understanding amino acids, gives us insight to how proteins and other key biological functions take place. Additionally, learning about the relationship between different amino acids can help further our understanding of diseases, mutations and other alignments to the human body.
ggpubr makes creating graphs and plots with ggplot2 easier. The package contains functions that allow the programmer to easily customize plots created by using ggplot2.
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.5-6
These data are derived mostly from Higgs (2009) Table 1. They are similar to Higgs and Attwood 2005 but have extra columns. Nathan Brouwer added additional categorical varibles based on online information.
The data represents all necessary information about all the 20 amino acids.
Make the vectors.
# 1 letter code
aa <-c('A','C','D','E','F','G','H','I','K','L','M','N',
'P','Q','R','S','T','V','W','Y')
## molecular weight in dalton
MW.da <-c(89,121,133,146,165,75,155,131,146,131,149,132,115,147,174,105,119,117,204,181)
## vol from van der Waals radii
vol <-c(67,86,91,109,135,48,118,124,135,124,124,96,90,
114,148,73,93,105,163,141)
## bulk – a measure of the shape of the side chain
bulk <-c(11.5,13.46,11.68,13.57,19.8,3.4,13.69,21.4,
15.71,21.4,16.25,12.28,17.43,
14.45,14.28,9.47,15.77,21.57,21.67,18.03)
## pol – a measure of the electric field strength around the molecule
pol <-c(0,1.48,49.7,49.9,0.35,0,51.6,0.13,49.5,0.13,
1.43,3.38,1.58,3.53,52,1.67,1.66,0.13,2.1,1.61)
## isoelec point
isoelec <-c(6,5.07,2.77,3.22,5.48,5.97,7.59,6.02,9.74,5.98,
5.74,5.41,6.3,5.65,10.76,5.68,6.16,5.96,5.89,5.66)
## 1st Hydrophobicity scale
H2Opho.34 <-c(1.8,2.5,-3.5,-3.5,2.8,-0.4,-3.2,4.5,-3.9,3.8,1.9,
-3.5,-1.6,-3.5,-4.5,-0.8,-0.7,4.2,-0.9,-1.3)
## 2nd Hydrophobicity scale
H2Opho.35 <-c(1.6,2,-9.2,-8.2,3.7,1,-3,3.1,-8.8,2.8,3.4,-4.8,
-0.2,-4.1,-12.3,0.6,1.2,2.6,1.9,-0.7)
## Surface area accessible to water in an unfolded peptide
saaH2O <-c(113,140,151,183,218,85,194,182,211,180,204,158,
143,189,241,122,146,160,259,229)
## Fraction of accessible area lost when a protein folds
faal.fold <-c(0.74,0.91,0.62,0.62,0.88,0.72,0.78,0.88,0.52,
0.85,0.85,0.63,0.64,0.62,0.64,0.66,0.7,0.86,0.85,0.76)
# Polar requirement
polar.req <-c(7,4.8,13,12.5,5,7.9,8.4,4.9,10.1,4.9,5.3,10,
6.6,8.6,9.1,7.5,6.6,5.6,5.2,5.4)
## relative frequency of occurance
## "The frequencies column shows the mean
### percentage of each amino acid in the protein sequences
### of modern organisms"
freq <-c(7.8,1.1,5.19,6.72,4.39,6.77,2.03,6.95,6.32,
10.15,2.28,4.37,4.26,3.45,5.23,6.46,5.12,7.01,1.09,3.3)
##
# charges
## un = Un-charged
## neg = negative
## pos = positive
charge<-c('un','un','neg','neg','un','un','pos','un','pos',
'un','un','un','un','un','pos','un','un','un','un','un')
## hydropathy
hydropathy<-c('hydrophobic','hydrophobic','hydrophilic','
hydrophilic','hydrophobic','neutral','neutral',
'hydrophobic','hydrophilic','hydrophobic','hydrophobic',
'hydrophilic','neutral',
'hydrophilic','hydrophilic','neutral','neutral',
'hydrophobic','hydrophobic','neutral')
## vol
vol.cat<-c('verysmall','small','small','medium',
'verylarge','verysmall','medium','large','large',
'large','large','small','small','medium','large',
'verysmall',
'small','medium','verylarge','verylarge')
## pol
pol.cat<-c('nonpolar','nonpolar','polar','polar',
'nonpolar','nonpolar','polar','nonpolar',
'polar','nonpolar','nonpolar','polar','nonpolar','polar',
'polar','polar','polar','nonpolar','nonpolar','polar')
## chemical
chemical<-c('aliphatic','sulfur','acidic','acidic','aromatic',
'aliphatic','basic','aliphatic','basic','aliphatic','sulfur',
'amide','aliphatic','amide','basic','hydroxyl','hydroxyl',
'aliphatic','aromatic','aromatic')
Build the dataframes.
### Build dataframe
aa_dat <- data.frame(aa,
MW.da,vol,
bulk, pol,
isoelec,H2Opho.34, H2Opho.35,
saaH2O, faal.fold, polar.req,
freq, charge, hydropathy,
vol.cat, pol.cat, chemical)
TODO: Examine the code below. What is it called? Replace the XXXXXXXX above with the name of this thing
cor_ <- round(cor(aa_dat[,-c(1,13:17)]),2)
diag(cor_) <- NA
cor_[upper.tri(cor_)] <- NA
cor_
## MW.da vol bulk pol isoelec H2Opho.34 H2Opho.35 saaH2O faal.fold
## MW.da NA NA NA NA NA NA NA NA NA
## vol 0.93 NA NA NA NA NA NA NA NA
## bulk 0.55 0.73 NA NA NA NA NA NA NA
## pol 0.29 0.24 -0.20 NA NA NA NA NA NA
## isoelec 0.20 0.36 0.08 0.27 NA NA NA NA NA
## H2Opho.34 -0.27 -0.08 0.44 -0.67 -0.20 NA NA NA NA
## H2Opho.35 -0.25 -0.16 0.32 -0.85 -0.26 0.85 NA NA NA
## saaH2O 0.97 0.99 0.64 0.29 0.35 -0.18 -0.23 NA NA
## faal.fold 0.11 0.18 0.49 -0.53 -0.18 0.84 0.79 0.12 NA
## polar.req -0.05 -0.19 -0.53 0.76 -0.11 -0.79 -0.87 -0.11 -0.81
## freq -0.52 -0.30 -0.04 -0.01 0.02 0.26 -0.02 -0.38 -0.18
## polar.req freq
## MW.da NA NA
## vol NA NA
## bulk NA NA
## pol NA NA
## isoelec NA NA
## H2Opho.34 NA NA
## H2Opho.35 NA NA
## saaH2O NA NA
## faal.fold NA NA
## polar.req NA NA
## freq 0.14 NA
The code below returns the index of the max value in cor_. The variable that has the most positive correlation is surface area accessible and H2O. The variable with the most negative correlation is the polar requirement and H2Opho.35.
which(cor_ == max(cor_, na.rm = T), arr.ind = T)
## row col
## saaH2O 8 2
which(cor_ == min(cor_, na.rm = T), arr.ind = T)
## row col
## polar.req 10 7
the code below uses a function to create a panel for the scatter plot matrix that compares the variables in cor_. The correlation is outputted in the panel and changing the size of the text based on the correlation.
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor,...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r)
}
Relationship between the 6 variables: The strongest correlations are between (MW.da and vol) The weakest correlations are between (vol and H2Opho.34). (MW.da and pol) show a non-linear relationship.
plot(aa_dat[c("MW.da", "vol", "pol", "H2Opho.34", "polar.req", "saaH2O", "faal.fold")], panel = panel.smooth, upper.panel = panel.cor, main = "Example")
This plot represents the amino acids in the data set comparing their isoelectic point vs. their volume.
plot(vol ~ isoelec, data = aa_dat,
xlab = "pI", ylab = "Volume(Angstrom)", main = "Amino Acids volume vs. pI",
col = 0)
text (vol ~ isoelec, labels = aa, data = aa_dat, col = 1:20)
ggscatter(y = "polar.req",
x = "freq",
size = "polar.req",
color = "polar.req",
data = aa_dat,
xlab = "pl",
ylab = "Volume (Angstroms)")
Creating a scatter plot with a non-linear relationship
ggscatter(y = "MW.da",
x = "polar.req",
size = "MW.da",
color = "bulk",
data = aa_dat,
xlab = "Polar Requirement",
ylab = "Molecular Weight (dalton)") + geom_smooth(method = "loess")
## `geom_smooth()` using formula 'y ~ x'
The log transformation allows non-linear data to be converted into somewhat linear data.
aa_dat$MW.da_log<- log(aa_dat$MW.da)
aa_dat$polar.req_log <- log(aa_dat$polar.req)
ggscatter(y = "MW.da_log",
x = "polar.req_log",
size = "MW.da",
color = "bulk",
add = "reg.line",
data = aa_dat,
xlab = "Polar Requirement",
ylab = "Molecular Weight (dalton)")
## `geom_smooth()` using formula 'y ~ x'
All three variables have a highly positive correlation. Molecular weight and surface area molecular weight and volume, and volume and surface area have a high positive correlation. The highest correlation is volume and surface area.
scatterplot3d(x = aa_dat$MW.da,
y = aa_dat$vol,
z = aa_dat$saaH2O,
highlight.3d = TRUE,
type = "h")
PCA allows us to manage large datasets by reducing the number of variables. This is especially useful to be able to visualize the relationship between fewer variables, allowing us to draw more accurate conclusions. “The principal components plot also shows that there is no particular similarity between amino acids in the same row” (Higgs and Attwood 2005, pg 4).
In this class, we created PCA plots in 2 ways: basic R PCA function and by using the vegan package. The basic R PCA function is not very flexible. With the vegan package, we used the rda() function to create the plot.
To create this data frame, I dropped the first two columns as well as columns 11-19 since those are the variables that are categorical.
aa_dat2 <- aa_dat[-c(1:2, 11:19)]
pca.out <- prcomp(aa_dat2, scale = TRUE)
biplot(pca.out)
scale = TRUE makes the PCA to consider correlations.
rda.out <- vegan :: rda(aa_dat2, scale = TRUE)
rda_scores <- scores(rda.out)
rda_scores
## $species
## PC1 PC2
## vol 0.1292455 -1.21867402
## bulk -0.5240422 -1.00021659
## pol 1.0392368 -0.22161142
## isoelec 0.4272346 -0.51780748
## H2Opho.34 -1.1576058 -0.06083202
## H2Opho.35 -1.1909235 0.07563303
## saaH2O 0.2293130 -1.17930581
## faal.fold -1.0664891 -0.35159129
##
## $sites
## PC1 PC2
## sit1 -0.431697708 1.01348876
## sit2 -0.812161280 0.50497652
## sit3 0.958626669 0.71357979
## sit4 0.941582653 0.22249037
## sit5 -0.867271594 -0.76051369
## sit6 -0.093468837 1.79557675
## sit7 0.643286377 -0.33344292
## sit8 -1.013877712 -0.56001005
## sit9 1.450808627 -0.67736662
## sit10 -0.904499352 -0.52173690
## sit11 -0.668686375 -0.38778285
## sit12 0.462264010 0.53066191
## sit13 0.007416261 0.36866189
## sit14 0.458970843 0.05206945
## sit15 1.580353509 -1.08451465
## sit16 -0.014704047 1.08414678
## sit17 -0.184996200 0.37158066
## sit18 -0.979015329 -0.23223164
## sit19 -0.420924604 -1.37484059
## sit20 -0.112005910 -0.72479295
##
## attr(,"const")
## [1] 3.511243
This plot created from the code below differs from Figure 2.10 in Higgs and Attwood because the points and scale are differnet as well the grouping. In my graph I chose to group the points by polarity. Additionally, my plot has 2 distinct clusters=, while Figure 2.10 has 4.
biplot(rda.out, display = "sites", col = 0,)
orditorp(rda.out, display = "sites", labels = aa_dat$aa, cex = 1.2)
ordihull(rda.out,
group = aa_dat$pol.cat,
col = 1:7,
lty = 1:7,
lwd = c(3,6))
Cluster analysis allows us to understand large data sets by grouping related points into clusters. This allows us to visualize different relationships between the data that may not be obvious just by looking at the data.
UPGMA uses a pair-wise distance matrix and compares the species, constructing a tree from the smallest distance between the groups of species. It produces a rooted tree.
Adding row names:
row.names(aa_dat2) <- paste(aa_dat$aa, sep = ".")
Euclidean Distance: a measurement of the distance between two points using the pythagorean theorem.
dist_euc <- dist(aa_dat2, method = "euclidean")
creating UPGMA
cluster_euc <- hclust(dist_euc, method = "average") # setting to UPGMA
plotting resulting dendrogram
plot(cluster_euc,hang = -1, cex = 0.5, labels = aa)
There are a few key differences between this dendrogram and Higgs and Attwood’s. Most of the clades are different, which could be due to a different clustering algorithm or could be due to different variables being used.