In this script we are recreating Higgs and Attwood’s analysis on the properties of amino acids. We are using different graphs and methods of data analysis, such as PCA and hierarchical cluster analysis, to analyze data of amino acids. The different graphs and methods of analyzing data allow us to see the relationships between amino acids and their properties.
ggpubr is a wrapper method for ggplot2. It allows for the creation of ggplot2-based graphs for people with non-advanced prgrammings backgrounds.
library(ggplot2) # makes nice graphs
library(ggpubr) # makes ggplot2 easier
library(vegan) # PCA has functions doing/plotting PCA
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.5-6
library(scatterplot3d) # for creating 3D scatterplots
These data are derived mostly from Higgs (2009) Table 1. They are similar to Higgs and Attwood 2005 but have extra columns. Nathan Brouwer added additional categorical variables based on online information.
Making the vectors
# 1 letter code
aa <-c('A','C','D','E','F','G','H','I','K','L','M','N',
'P','Q','R','S','T','V','W','Y')
## molecular weight in dalton
MW.da <-c(89,121,133,146,165,75,155,131,146,131,149,132,115,147,174,105,119,117,204,181)
## vol from van der Waals radii
vol <-c(67,86,91,109,135,48,118,124,135,124,124,96,90,
114,148,73,93,105,163,141)
## bulk – a measure of the shape of the side chain
bulk <-c(11.5,13.46,11.68,13.57,19.8,3.4,13.69,21.4,
15.71,21.4,16.25,12.28,17.43,
14.45,14.28,9.47,15.77,21.57,21.67,18.03)
## pol – a measure of the electric field strength around the molecule
pol <-c(0,1.48,49.7,49.9,0.35,0,51.6,0.13,49.5,0.13,
1.43,3.38,1.58,3.53,52,1.67,1.66,0.13,2.1,1.61)
## isoelec point
isoelec <-c(6,5.07,2.77,3.22,5.48,5.97,7.59,6.02,9.74,5.98,
5.74,5.41,6.3,5.65,10.76,5.68,6.16,5.96,5.89,5.66)
## 1st Hydrophobicity scale
H2Opho.34 <-c(1.8,2.5,-3.5,-3.5,2.8,-0.4,-3.2,4.5,-3.9,3.8,1.9,
-3.5,-1.6,-3.5,-4.5,-0.8,-0.7,4.2,-0.9,-1.3)
## 2nd Hydrophobicity scale
H2Opho.35 <-c(1.6,2,-9.2,-8.2,3.7,1,-3,3.1,-8.8,2.8,3.4,-4.8,
-0.2,-4.1,-12.3,0.6,1.2,2.6,1.9,-0.7)
## Surface area accessible to water in an unfolded peptide
saaH2O <-c(113,140,151,183,218,85,194,182,211,180,204,158,
143,189,241,122,146,160,259,229)
## Fraction of accessible area lost when a protein folds
faal.fold <-c(0.74,0.91,0.62,0.62,0.88,0.72,0.78,0.88,0.52,
0.85,0.85,0.63,0.64,0.62,0.64,0.66,0.7,0.86,0.85,0.76)
# Polar requirement
polar.req <-c(7,4.8,13,12.5,5,7.9,8.4,4.9,10.1,4.9,5.3,10,
6.6,8.6,9.1,7.5,6.6,5.6,5.2,5.4)
## relative frequency of occurance
## "The frequencies column shows the mean
### percentage of each amino acid in the protein sequences
### of modern organisms"
freq <-c(7.8,1.1,5.19,6.72,4.39,6.77,2.03,6.95,6.32,
10.15,2.28,4.37,4.26,3.45,5.23,6.46,5.12,7.01,1.09,3.3)
##
# charges
## un = Un-charged
## neg = negative
## pos = positive
charge<-c('un','un','neg','neg','un','un','pos','un','pos',
'un','un','un','un','un','pos','un','un','un','un','un')
## hydropathy
hydropathy<-c('hydrophobic','hydrophobic','hydrophilic','
hydrophilic','hydrophobic','neutral','neutral',
'hydrophobic','hydrophilic','hydrophobic','hydrophobic',
'hydrophilic','neutral',
'hydrophilic','hydrophilic','neutral','neutral',
'hydrophobic','hydrophobic','neutral')
## vol
vol.cat<-c('verysmall','small','small','medium',
'verylarge','verysmall','medium','large','large',
'large','large','small','small','medium','large',
'verysmall',
'small','medium','verylarge','verylarge')
## pol
pol.cat<-c('nonpolar','nonpolar','polar','polar',
'nonpolar','nonpolar','polar','nonpolar',
'polar','nonpolar','nonpolar','polar','nonpolar','polar',
'polar','polar','polar','nonpolar','nonpolar','polar')
## chemical
chemical<-c('aliphatic','sulfur','acidic','acidic','aromatic',
'aliphatic','basic','aliphatic','basic','aliphatic','sulfur',
'amide','aliphatic','amide','basic','hydroxyl','hydroxyl',
'aliphatic','aromatic','aromatic')
This makes a dataframe with all of these vectors as columns. The vectors were created above
### Build dataframe
aa_dat <- data.frame(aa,
MW.da,vol,
bulk, pol,
isoelec,H2Opho.34, H2Opho.35,
saaH2O, faal.fold, polar.req,
freq, charge, hydropathy,
vol.cat, pol.cat, chemical)
aa_dat
## aa MW.da vol bulk pol isoelec H2Opho.34 H2Opho.35 saaH2O faal.fold
## 1 A 89 67 11.50 0.00 6.00 1.8 1.6 113 0.74
## 2 C 121 86 13.46 1.48 5.07 2.5 2.0 140 0.91
## 3 D 133 91 11.68 49.70 2.77 -3.5 -9.2 151 0.62
## 4 E 146 109 13.57 49.90 3.22 -3.5 -8.2 183 0.62
## 5 F 165 135 19.80 0.35 5.48 2.8 3.7 218 0.88
## 6 G 75 48 3.40 0.00 5.97 -0.4 1.0 85 0.72
## 7 H 155 118 13.69 51.60 7.59 -3.2 -3.0 194 0.78
## 8 I 131 124 21.40 0.13 6.02 4.5 3.1 182 0.88
## 9 K 146 135 15.71 49.50 9.74 -3.9 -8.8 211 0.52
## 10 L 131 124 21.40 0.13 5.98 3.8 2.8 180 0.85
## 11 M 149 124 16.25 1.43 5.74 1.9 3.4 204 0.85
## 12 N 132 96 12.28 3.38 5.41 -3.5 -4.8 158 0.63
## 13 P 115 90 17.43 1.58 6.30 -1.6 -0.2 143 0.64
## 14 Q 147 114 14.45 3.53 5.65 -3.5 -4.1 189 0.62
## 15 R 174 148 14.28 52.00 10.76 -4.5 -12.3 241 0.64
## 16 S 105 73 9.47 1.67 5.68 -0.8 0.6 122 0.66
## 17 T 119 93 15.77 1.66 6.16 -0.7 1.2 146 0.70
## 18 V 117 105 21.57 0.13 5.96 4.2 2.6 160 0.86
## 19 W 204 163 21.67 2.10 5.89 -0.9 1.9 259 0.85
## 20 Y 181 141 18.03 1.61 5.66 -1.3 -0.7 229 0.76
## polar.req freq charge hydropathy vol.cat pol.cat
## 1 7.0 7.80 un hydrophobic verysmall nonpolar
## 2 4.8 1.10 un hydrophobic small nonpolar
## 3 13.0 5.19 neg hydrophilic small polar
## 4 12.5 6.72 neg \n hydrophilic medium polar
## 5 5.0 4.39 un hydrophobic verylarge nonpolar
## 6 7.9 6.77 un neutral verysmall nonpolar
## 7 8.4 2.03 pos neutral medium polar
## 8 4.9 6.95 un hydrophobic large nonpolar
## 9 10.1 6.32 pos hydrophilic large polar
## 10 4.9 10.15 un hydrophobic large nonpolar
## 11 5.3 2.28 un hydrophobic large nonpolar
## 12 10.0 4.37 un hydrophilic small polar
## 13 6.6 4.26 un neutral small nonpolar
## 14 8.6 3.45 un hydrophilic medium polar
## 15 9.1 5.23 pos hydrophilic large polar
## 16 7.5 6.46 un neutral verysmall polar
## 17 6.6 5.12 un neutral small polar
## 18 5.6 7.01 un hydrophobic medium nonpolar
## 19 5.2 1.09 un hydrophobic verylarge nonpolar
## 20 5.4 3.30 un neutral verylarge polar
## chemical
## 1 aliphatic
## 2 sulfur
## 3 acidic
## 4 acidic
## 5 aromatic
## 6 aliphatic
## 7 basic
## 8 aliphatic
## 9 basic
## 10 aliphatic
## 11 sulfur
## 12 amide
## 13 aliphatic
## 14 amide
## 15 basic
## 16 hydroxyl
## 17 hydroxyl
## 18 aliphatic
## 19 aromatic
## 20 aromatic
This creates a symmetric matrix and puts coefficients in the dataframe. It also puts NA on the diagonal and for repeating data
cor_ <- round(cor(aa_dat[,-c(1,13:17)]),2)
diag(cor_) <- NA
cor_[upper.tri(cor_)] <- NA
cor_
## MW.da vol bulk pol isoelec H2Opho.34 H2Opho.35 saaH2O faal.fold
## MW.da NA NA NA NA NA NA NA NA NA
## vol 0.93 NA NA NA NA NA NA NA NA
## bulk 0.55 0.73 NA NA NA NA NA NA NA
## pol 0.29 0.24 -0.20 NA NA NA NA NA NA
## isoelec 0.20 0.36 0.08 0.27 NA NA NA NA NA
## H2Opho.34 -0.27 -0.08 0.44 -0.67 -0.20 NA NA NA NA
## H2Opho.35 -0.25 -0.16 0.32 -0.85 -0.26 0.85 NA NA NA
## saaH2O 0.97 0.99 0.64 0.29 0.35 -0.18 -0.23 NA NA
## faal.fold 0.11 0.18 0.49 -0.53 -0.18 0.84 0.79 0.12 NA
## polar.req -0.05 -0.19 -0.53 0.76 -0.11 -0.79 -0.87 -0.11 -0.81
## freq -0.52 -0.30 -0.04 -0.01 0.02 0.26 -0.02 -0.38 -0.18
## polar.req freq
## MW.da NA NA
## vol NA NA
## bulk NA NA
## pol NA NA
## isoelec NA NA
## H2Opho.34 NA NA
## H2Opho.35 NA NA
## saaH2O NA NA
## faal.fold NA NA
## polar.req NA NA
## freq 0.14 NA
The saaH2O(Surface area accessible to water in an unfolded peptide) and vol (volume from van der Waals radii) have the most positive correlation of 0.99. The pol.req (polar requirement) and H2Opho.35 (2nd Hydrophobicity scale) have the most negative correlation with -0.87.
which(cor_ == max(cor_, na.rm = T), arr.ind = T)
## row col
## saaH2O 8 2
which(cor_ == min(cor_, na.rm = T), arr.ind = T)
## row col
## polar.req 10 7
This function plots the correlation coefficients etween variables. In the scatterplot matrix below, upper.panel = panel.cor and the upper half of the matrix is just the correlation coefficients instead of the same graphs again.
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor,...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r)
}
In addition to the strongest correlation written about earlier, Molecular weight in daltons (MW.da) and volume have a very storng correlation along with MW.da and saaH2O. Volume and the 1st Hydrophobicity scale(H2O.34) have a very weak correlation. Same with MW.da and polar.req, there is a very weak correlation. H2O.34 and pol have a very nonlinear relationship in addition to polar.req and faal.fold(Fraction of accessible area lost when a protein folds)
#make scatterplot matrix
plot(aa_dat[ ,c("MW.da", "vol", "pol", "H2Opho.34", "polar.req", "saaH2O", "faal.fold")],
panel = panel.smooth,
upper.panel = panel.cor )
Plot of the amino acid polar requirement versus the percentage of each amino acid in the protein sequences
#plot() creates blank plot
plot(polar.req ~ freq, data = aa_dat,
xlab = "Polar Requirement", #adding x axis label
ylab = "Frequency", #adding y axis label
main = "Relative Frequency of Occurance vs Polar Requirement",
col = 0)
#text() uses same tilda notation as plot() to plot the labels
text(polar.req ~ freq,
labels = aa,
data = aa_dat,
col = 1:20)
ggscatter(y = "polar.req",
x = "freq",
shape = "chemical",
color = "polar.req",
data = aa_dat,
xlab = "Frequency",
ylab = "Polar Requirement")
## Warning: The shape palette can deal with a maximum of 6 discrete values because
## more than 6 becomes difficult to discriminate; you have 7. Consider
## specifying shapes manually if you must have them.
## Warning: Removed 2 rows containing missing values (geom_point).
### Figure 3: Highly Correlated Data The general mathematical form of the regression line is y = mx + b. where y is the dependent variable and x is the explanatory variable. The slope of the line is b and m is the y intercept.
A data ellipse is similar to a line of regression in that it tries to situate the data and show a general trend. A data ellipse superimposes the normal-probability contours over a scatterplot of the data
ggscatter(y = "polar.req",
x = "freq",
size = "polar.req",
add = "reg.line", # line of best fit
ellipse = TRUE, # data ellipse
cor.coef = TRUE, # correlation coefficient
data = aa_dat,
xlab = "Frequency",
ylab = "Polar Requirement")
## `geom_smooth()` using formula 'y ~ x'
?ellipse
ggscatter(y = "pol",
x = "H2Opho.34",
add = "loess",
ylab = "Pol",
xlab = "1st Hydrophobicity scale",
color = "freq",
size = "faal.fold",
data = aa_dat)
## `geom_smooth()` using formula 'y ~ x'
Taking the log transformation both of the variables can be used to re-scale variables or make the relationships linear
aa_dat$pol_log <- log(aa_dat$pol)
aa_dat$H2Opho.34_log <- log(aa_dat$H2Opho.34)
## Warning in log(aa_dat$H2Opho.34): NaNs produced
Plotting the same scatterplot as Figure 4 but with the log of the variables
ggscatter(y = "pol_log",
x = "H2Opho.34_log",
add = "reg.line",
ylab = "Logarithmic Pol",
xlab = "Log of 1st Hydrophobicity scale",
size = "faal.fold",
color = "freq",
data = aa_dat)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 14 rows containing non-finite values (stat_smooth).
## Warning: Removed 13 rows containing missing values (geom_point).
All three of these variables have very high correlations with each other, all above 0.90, therefore I decided to plot all of them together.
scatterplot3d(x = aa_dat$MW.da,
y = aa_dat$saaH2O,
z = aa_dat$vol,
highlight.3d = TRUE,
type = "h")
PCA is a dimensionality-reduction method that is used on larger data sets to make it easier to work with yet keeping most of the information. The number of variables are reduced to make the data easier to visualize and analyze. Higgs and Attwood describe it as follows, “Each row in the data matrix can be thought of as the coordinates of a point in P-dimensional space. The whole data set is a cloud of these points. The PCA method transforms this cloud of points first by scaling them and shifting them to the origin, and then by rotating them in such a way that the points are spread out as much as possible, and the structure in the data is made easier to see.” (Higgs and Attwood 2005, pg 26) In class we have used PCA for phylogenetic trees.
I used column names to create a subset of the previous dataframe aa_dat.
aa_dat2 <- aa_dat[,c("vol", "bulk", "pol", "H2Opho.34", "isoelec", "H2Opho.35", "saaH2O", "faal.fold")]
Running a PCA and creating a PCA plot using base R command
pca.out <- prcomp(na.omit(aa_dat2), scale = TRUE)
Plotting the output
biplot(pca.out)
Scale centers and/or scales the columns of a numeric matrix
rda.out <- vegan::rda(aa_dat2, scale = TRUE)
Extract PCA scores
rda_scores <- scores(rda.out)
There are some differences between this plot and the Higgs and Attwood plot on page 28. A main difference being that there is a very different sizes in the clusters in the plot below. There are three clusters, however one is significantly larger and the other two hold only two or three points. The largest cluster also spans all four quadranst while the other two smaller clusters stick to one quadrant.
Making a biplot, or PCA plot.
biplot(rda.out, display = "sites", col = 0,)
orditorp(rda.out, display = "sites", labels = aa_dat$aa, cex = 1.2)
ordihull(rda.out,
group = aa_dat$charge,
col = 1:7,
lty = 1:7,
lwd = c(3,6))
The purpose of cluster analysis is to see trends in the data and see which points of data are close to each other. The data is placed into groups, or clusters, for statistical data analysis.
UPGMA is a simple hierarchical clustering method. After a distance matrix has been created, find the two species with the least amount of differences. An average distance is caluclated and the two species become a cluster, or a unit. Then the process is continued with each cluster acting as one unit and being compared to other species to find a new cluster.
Creating row names
row.names(aa_dat2) <- paste(aa_dat$aa,
sep = ".")
A distance matrix is a matrix containing the distances between elements of a set. These distances are taken pairwise meaning the distance between each pair of the species. Euclidean distance is the straight line distance between two points on a plane.
Creating a distance matrix; calculating distance using euclidean method
dist_euc <- dist(aa_dat2,
method = "euclidean")
Using the hclust() function to perform cluster analysis
clust_euc <- hclust(dist_euc, method = "average")
There are a few differences between my graph below and the one in Higgs and Attwood 2005. One main one I am noticing is there are about 6 clusters in the paper and there are also about 6 in my graph, they are just structured differently.
dendro_euc <- as.dendrogram(clust_euc)
plot(dendro_euc, horiz = T,
nodePar = list(pch = c(1,NA),
cex = 0.5,
lab.cex = 1.2))