In this script we want to replicate the analysis on the properties of amino acids done by Higgs and Attwood.By building data frames from multiple variables in relation to amino acids, we can identify relationships and possible correlations. In this script we make different types of plots and diagrams to visualize the patterns among the amino acids and their properties. We care about this biologically because by understanding the relationships between the properties of amino acids, we can better understand and predict their actions and functions.
ggpubr creates wrappers for ggplot2’s core functionality.
library(permute) ## Loading required package: permute
library(lattice) ## Loading required package: lattice
library(vegan) ## This is vegan 2.5-6
## This is vegan 2.5-6
These data are derived mostly from Higgs (2009) Table 1. They are similar to Higgs and Attwood 2005 but have extra columns. Nathan Brouwer added additional categorical varibles based on online information. The data below represents properties of 20 amino acids.
# 1 letter code
aa <-c('A','C','D','E','F','G','H','I','K','L','M','N', 'P','Q','R','S','T','V','W','Y')
## molecular weight in dalton
MW.da <-c(89,121,133,146,165,75,155,131,146,131,149,132,115,147,174,105,119,117,204,181)
## vol from van der Waals radii
vol <-c(67,86,91,109,135,48,118,124,135,124,124,96,90, 114,148,73,93,105,163,141)
## bulk – a measure of the shape of the side chain
bulk <-c(11.5,13.46,11.68,13.57,19.8,3.4,13.69,21.4, 15.71,21.4,16.25,12.28,17.43,
14.45,14.28,9.47,15.77,21.57,21.67,18.03)
## pol – a measure of the electric field strength around the molecule
pol <-c(0,1.48,49.7,49.9,0.35,0,51.6,0.13,49.5,0.13, 1.43,3.38,1.58,3.53,52,1.67,1.66,0.13,2.1,1.61)
## isoelec point
isoelec <-c(6,5.07,2.77,3.22,5.48,5.97,7.59,6.02,9.74,5.98, 5.74,5.41,6.3,5.65,10.76,5.68,6.16,5.96,5.89,5.66)
## 1st Hydrophobicity scale
H2Opho.34 <-c(1.8,2.5,-3.5,-3.5,2.8,-0.4,-3.2,4.5,-3.9,3.8,1.9, -3.5,-1.6,-3.5,-4.5,-0.8,-0.7,4.2,-0.9,-1.3)
## 2nd Hydrophobicity scale
H2Opho.35 <-c(1.6,2,-9.2,-8.2,3.7,1,-3,3.1,-8.8,2.8,3.4,-4.8, -0.2,-4.1,-12.3,0.6,1.2,2.6,1.9,-0.7)
## Surface area accessible to water in an unfolded peptide
saaH2O <-c(113,140,151,183,218,85,194,182,211,180,204,158, 143,189,241,122,146,160,259,229)
## Fraction of accessible area lost when a protein folds
faal.fold <-c(0.74,0.91,0.62,0.62,0.88,0.72,0.78,0.88,0.52, 0.85,0.85,0.63,0.64,0.62,0.64,0.66,0.7,0.86,0.85,0.76)
# Polar requirement
polar.req <-c(7,4.8,13,12.5,5,7.9,8.4,4.9,10.1,4.9,5.3,10, 6.6,8.6,9.1,7.5,6.6,5.6,5.2,5.4)
## relative frequency of occurance
## "The frequencies column shows the mean
### percentage of each amino acid in the protein sequences ### of modern organisms"
freq <-c(7.8,1.1,5.19,6.72,4.39,6.77,2.03,6.95,6.32,
10.15,2.28,4.37,4.26,3.45,5.23,6.46,5.12,7.01,1.09,3.3)
##
# charges
## un = Un-charged
## neg = negative
## pos = positive
charge<-c('un','un','neg','neg','un','un','pos','un','pos','un','un','un','un','un','pos','un','un','un','un','un')
## hydropathy
hydropathy<-c('hydrophobic','hydrophobic','hydrophilic',' hydrophilic','hydrophobic','neutral','neutral',
'hydrophobic','hydrophilic','hydrophobic','hydrophobic',
'hydrophilic','neutral',
'hydrophilic','hydrophilic','neutral','neutral',
'hydrophobic','hydrophobic','neutral')
## vol
vol.cat<-c('verysmall','small','small','medium', 'verylarge','verysmall','medium','large','large',
'large','large','small','small','medium','large',
'verysmall',
'small','medium','verylarge','verylarge')
## pol
pol.cat<-c('nonpolar','nonpolar','polar','polar', 'nonpolar','nonpolar','polar','nonpolar',
'polar','nonpolar','nonpolar','polar','nonpolar','polar',
'polar','polar','polar','nonpolar','nonpolar','polar')
## chemical
chemical<-c('aliphatic','sulfur','acidic','acidic','aromatic', 'aliphatic','basic','aliphatic','basic','aliphatic','sulfur', 'amide','aliphatic','amide','basic','hydroxyl','hydroxyl', 'aliphatic','aromatic','aromatic')
Build the Dataframes
aa_dat <- data.frame(aa,MW.da, vol, bulk, pol, isoelec, H2Opho.34, H2Opho.35, saaH2O, faal.fold, polar.req, freq, charge, hydropathy, vol.cat, pol.cat, chemical)
aa_dat
## aa MW.da vol bulk pol isoelec H2Opho.34 H2Opho.35 saaH2O faal.fold
## 1 A 89 67 11.50 0.00 6.00 1.8 1.6 113 0.74
## 2 C 121 86 13.46 1.48 5.07 2.5 2.0 140 0.91
## 3 D 133 91 11.68 49.70 2.77 -3.5 -9.2 151 0.62
## 4 E 146 109 13.57 49.90 3.22 -3.5 -8.2 183 0.62
## 5 F 165 135 19.80 0.35 5.48 2.8 3.7 218 0.88
## 6 G 75 48 3.40 0.00 5.97 -0.4 1.0 85 0.72
## 7 H 155 118 13.69 51.60 7.59 -3.2 -3.0 194 0.78
## 8 I 131 124 21.40 0.13 6.02 4.5 3.1 182 0.88
## 9 K 146 135 15.71 49.50 9.74 -3.9 -8.8 211 0.52
## 10 L 131 124 21.40 0.13 5.98 3.8 2.8 180 0.85
## 11 M 149 124 16.25 1.43 5.74 1.9 3.4 204 0.85
## 12 N 132 96 12.28 3.38 5.41 -3.5 -4.8 158 0.63
## 13 P 115 90 17.43 1.58 6.30 -1.6 -0.2 143 0.64
## 14 Q 147 114 14.45 3.53 5.65 -3.5 -4.1 189 0.62
## 15 R 174 148 14.28 52.00 10.76 -4.5 -12.3 241 0.64
## 16 S 105 73 9.47 1.67 5.68 -0.8 0.6 122 0.66
## 17 T 119 93 15.77 1.66 6.16 -0.7 1.2 146 0.70
## 18 V 117 105 21.57 0.13 5.96 4.2 2.6 160 0.86
## 19 W 204 163 21.67 2.10 5.89 -0.9 1.9 259 0.85
## 20 Y 181 141 18.03 1.61 5.66 -1.3 -0.7 229 0.76
## polar.req freq charge hydropathy vol.cat pol.cat chemical
## 1 7.0 7.80 un hydrophobic verysmall nonpolar aliphatic
## 2 4.8 1.10 un hydrophobic small nonpolar sulfur
## 3 13.0 5.19 neg hydrophilic small polar acidic
## 4 12.5 6.72 neg hydrophilic medium polar acidic
## 5 5.0 4.39 un hydrophobic verylarge nonpolar aromatic
## 6 7.9 6.77 un neutral verysmall nonpolar aliphatic
## 7 8.4 2.03 pos neutral medium polar basic
## 8 4.9 6.95 un hydrophobic large nonpolar aliphatic
## 9 10.1 6.32 pos hydrophilic large polar basic
## 10 4.9 10.15 un hydrophobic large nonpolar aliphatic
## 11 5.3 2.28 un hydrophobic large nonpolar sulfur
## 12 10.0 4.37 un hydrophilic small polar amide
## 13 6.6 4.26 un neutral small nonpolar aliphatic
## 14 8.6 3.45 un hydrophilic medium polar amide
## 15 9.1 5.23 pos hydrophilic large polar basic
## 16 7.5 6.46 un neutral verysmall polar hydroxyl
## 17 6.6 5.12 un neutral small polar hydroxyl
## 18 5.6 7.01 un hydrophobic medium nonpolar aliphatic
## 19 5.2 1.09 un hydrophobic verylarge nonpolar aromatic
## 20 5.4 3.30 un neutral verylarge polar aromatic
This code makes a correlation matrix.
cor_ <- round(cor(aa_dat[,-c(1,13:17)]),2)
diag(cor_) <- NA
cor_[upper.tri(cor_)] <- NA
cor_
## MW.da vol bulk pol isoelec H2Opho.34 H2Opho.35 saaH2O faal.fold
## MW.da NA NA NA NA NA NA NA NA NA
## vol 0.93 NA NA NA NA NA NA NA NA
## bulk 0.55 0.73 NA NA NA NA NA NA NA
## pol 0.29 0.24 -0.20 NA NA NA NA NA NA
## isoelec 0.20 0.36 0.08 0.27 NA NA NA NA NA
## H2Opho.34 -0.27 -0.08 0.44 -0.67 -0.20 NA NA NA NA
## H2Opho.35 -0.25 -0.16 0.32 -0.85 -0.26 0.85 NA NA NA
## saaH2O 0.97 0.99 0.64 0.29 0.35 -0.18 -0.23 NA NA
## faal.fold 0.11 0.18 0.49 -0.53 -0.18 0.84 0.79 0.12 NA
## polar.req -0.05 -0.19 -0.53 0.76 -0.11 -0.79 -0.87 -0.11 -0.81
## freq -0.52 -0.30 -0.04 -0.01 0.02 0.26 -0.02 -0.38 -0.18
## polar.req freq
## MW.da NA NA
## vol NA NA
## bulk NA NA
## pol NA NA
## isoelec NA NA
## H2Opho.34 NA NA
## H2Opho.35 NA NA
## saaH2O NA NA
## faal.fold NA NA
## polar.req NA NA
## freq 0.14 NA
Locating the points of maximum correlation.
which(cor_ == max(cor_, na.rm = T), arr.ind = T)
## row col
## saaH2O 8 2
cor_[which(cor_ == max(cor_, na.rm = T), arr.ind = T)]
## [1] 0.99
The maximum correlation (strongest positive) is between saaH2O (row 8) and volume (column 2) at 0.99.
Locating the points of minimum correlation.
which(cor_ == min(cor_, na.rm = T), arr.ind = T)
## row col
## polar.req 10 7
cor_[which(cor_ == min(cor_, na.rm = T), arr.ind = T)]
## [1] -0.87
The minimum correlation (strongest negative) is between polar requirement (row 10) and the 2nd hyrophobicity scale (column 7) at -0.87.
The code below formats the scatterplot matrix by taking variables from the data frame.
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor,...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r)
}
This code plots the scatterplot matrix.
aa_matrix <- data.frame(MW.da, vol, pol, H2Opho.34, polar.req, saaH2O, faal.fold)
plot(aa_matrix,upper.panel = panel.cor,
panel = panel.smooth,
main = "Plot 0: Scatterplot matrix")
We can observe the relationship between 6 variables in this scatterplot matrix. It shows correlation values and if these values are 0.7 or above there is a strong positive correlation between the variables and a strong negative correlation if the value is -0.7 or lower.
Replicating the plot from Figure 2.8 of Higgs and Attwood showing the relationship between volume and pI in amino acids.
#main plotting
plot(vol ~ isoelec,
data = aa_dat,
xlab = "pI",
ylab = "Volume(Å3)",
main = "Plot 1: Amino acid vol vs. pI",
col = 0)
#adding text labels
text(vol ~ isoelec,
labels = aa,
data = aa_dat,
col = 1:20)
The plot assists in visualizing the relationship between volume and isoeletric point in 20 amino acids.
Making a plot that compares reelationship between polar requirement and frequency
ggpubr::ggscatter(y = "polar.req",
x = "freq",
size = "MW.da",
color = "polar.req",
data = aa_dat,
xlab = "polar requirement",
ylab = "frequency")
ggpubr::ggscatter(x = "saaH2O",
y = "polar.req",
data = aa_dat,
add ="loess",
size = "pol",
color = "faal.fold",
xlab = "Surface area accessible to water in an unfolded peptide",
ylab = "Polar Requirement")
## `geom_smooth()` using formula 'y ~ x'
aa_dat$saaH2O_log <- log(aa_dat$saaH2O)
aa_dat$polar.req_log <- log(aa_dat$polar.req)
ggpubr::ggscatter(x = "saaH2O_log",
y = "polar.req_log",
data = aa_dat,
add = "reg.line",
size = "pol",
color = "faal.fold",
xlab = "Log Surface area accessible to water in an unfolded peptide",
ylab = "Log polar requirement")
## `geom_smooth()` using formula 'y ~ x'
The log transformation rescales the two variables from a non-linear relationship in order to make them linear.
library(scatterplot3d)
aa_3d <- scatterplot3d(x = aa_dat$H2Opho.34,
y = aa_dat$faal.fold,
z = aa_dat$polar.req,
xlab = "H2Opho.34",
ylab = "faal.fold",
zlab = "polar.req",
highlight.3d = TRUE,
type = "h")
The correlation between the fraction of accessible area lost when a protein folds and the 1st hydrophobicity is positive. The correlation between polar requirement and the fraction of accessible area lost when a protein folds is negative. The correlation between polar requirement and 1st hydrophobicity is also negative.
PCA (principal component analysis) is a statistical technique that is used to reduce the dimensional data of data sets. This allows for easier examination and interpretation of the variances of many variables. Higgs and Atwood describe PCA as having the ability that “shows that there is no particular similarity between amino acids in the same row” (Higgs and Attwood 2005, pg 4). In this class, we used two basic ways we make PCA plots. The two ways are through vegan and base R. The number of objects in a data set and the number of properties evaluated are used in creating the plots.
A new data set was created based off Table 2.2. The categorical variables were dropped.
aa_dat2 <- aa_dat[c("vol","bulk","pol","isoelec","H2Opho.34","H2Opho.35","saaH2O","faal.fold")]
row.names(aa_dat2) <- aa_dat$aa
pca.out <- prcomp(aa_dat2, scale = TRUE)
biplot(pca.out)
scale = TRUE allows for variable scaling using standard devation.
rda.out <- vegan::rda(aa_dat2, scale = TRUE)
rda_scores <- scores(rda.out)
biplot(rda.out, display ="sites", col = 0)
orditorp(rda.out, display = "sites", labels = aa_dat$aa, cex = 1.2)
ordihull(rda.out,
group = aa_dat$hydropathy,
col = 1:7,
lty = 1:7,
lwd = c(3,6))
The plot above and the one from Figure 2.10 are a bit different. The variables are different with different numerical ranges. The orientation of the plots seem completely opposite, almost like it flipped.
The purpose of cluster analysis is to place objects into groups, or clusters, suggested by the data. This allows for the identification of patterns in the data set. UPGMA is a hierarchical cluster analysis method. A distance matrix is first made from data. The the Euclidean distance are also taken into consideration among the amino acids. Clusters are made by combining the most similar amino acids and then combining the nearest clusters to continue making bigger clusters.
A distance matrix shows the difference between pairs of objects. For this analysis we are looking at differences between amino acids in different data categories. The Euclidean distance is the distance between two points in a Euclidean space. The numerical value of the points is the Euclidean distance.
dist_euc <- dist(aa_dat2,
method = "euclidean")
clust_euc <- hclust(dist_euc,
method = "average") # setting to use UPGMA
dendro_euc <- as.dendrogram(clust_euc)
plot(dendro_euc, horiz = TRUE,
nodePar = list(pch = c(1,NA),
cex = 0.5,
lab.cex = 0.5))
There are a few difference between the cluster diagram above and the one produced by Higgs and Atwood. Some of the amino acids clustered together in the Higgs and Atwood diagram are not closely related in the dendrogram. Most noticeably, A and S are not on the same cluster in Higgs and Atwood but are closely related in the above dendrogram.