load the following packages. ggplot2 is used to create advanced graphs to visualize data, and ggpubr makes ggplot2 more accessible to less experienced coders.
library(ggplot2) # makes plots
library(ggpubr) # makes ggplot2 easier
library(vegan) # PCA has functions doing/plotting PCA
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.5-7
library(scatterplot3d) # graphing 3 variables
library(permute) # aids with spatial grid designs
library(lattice) # advanced data visualization
These data are derived mostly from Higgs (2009) Table 1. They are similar to Higgs and Attwood 2005 but have extra columns. Nathan Brouwer added additional categorical varibles based on online information. These include information about the 20 amino acids including size (weight and volume), polarity, frequency, etc..
Make the vectors.
# 1 letter code
aa <-c('A','C','D','E','F','G','H','I','K','L','M','N',
'P','Q','R','S','T','V','W','Y')
## molecular weight in dalton
MW.da <-c(89,121,133,146,165,75,155,131,146,131,149,132,115,147,174,105,119,117,204,181)
## vol from van der Waals radii
vol <-c(67,86,91,109,135,48,118,124,135,124,124,96,90,
114,148,73,93,105,163,141)
## bulk – a measure of the shape of the side chain
bulk <-c(11.5,13.46,11.68,13.57,19.8,3.4,13.69,21.4,
15.71,21.4,16.25,12.28,17.43,
14.45,14.28,9.47,15.77,21.57,21.67,18.03)
## pol – a measure of the electric field strength around the molecule
pol <-c(0,1.48,49.7,49.9,0.35,0,51.6,0.13,49.5,0.13,
1.43,3.38,1.58,3.53,52,1.67,1.66,0.13,2.1,1.61)
## isoelec point
isoelec <-c(6,5.07,2.77,3.22,5.48,5.97,7.59,6.02,9.74,5.98,
5.74,5.41,6.3,5.65,10.76,5.68,6.16,5.96,5.89,5.66)
## 1st Hydrophobicity scale
H2Opho.34 <-c(1.8,2.5,-3.5,-3.5,2.8,-0.4,-3.2,4.5,-3.9,3.8,1.9,
-3.5,-1.6,-3.5,-4.5,-0.8,-0.7,4.2,-0.9,-1.3)
## 2nd Hydrophobicity scale
H2Opho.35 <-c(1.6,2,-9.2,-8.2,3.7,1,-3,3.1,-8.8,2.8,3.4,-4.8,
-0.2,-4.1,-12.3,0.6,1.2,2.6,1.9,-0.7)
## Surface area accessible to water in an unfolded peptide
saaH2O <-c(113,140,151,183,218,85,194,182,211,180,204,158,
143,189,241,122,146,160,259,229)
## Fraction of accessible area lost when a protein folds
faal.fold <-c(0.74,0.91,0.62,0.62,0.88,0.72,0.78,0.88,0.52,
0.85,0.85,0.63,0.64,0.62,0.64,0.66,0.7,0.86,0.85,0.76)
# Polar requirement
polar.req <-c(7,4.8,13,12.5,5,7.9,8.4,4.9,10.1,4.9,5.3,10,
6.6,8.6,9.1,7.5,6.6,5.6,5.2,5.4)
## relative frequency of occurance
## "The frequencies column shows the mean
### percentage of each amino acid in the protein sequences
### of modern organisms"
freq <-c(7.8,1.1,5.19,6.72,4.39,6.77,2.03,6.95,6.32,
10.15,2.28,4.37,4.26,3.45,5.23,6.46,5.12,7.01,1.09,3.3)
##
# charges
## un = Un-charged
## neg = negative
## pos = positive
charge<-c('un','un','neg','neg','un','un','pos','un','pos',
'un','un','un','un','un','pos','un','un','un','un','un')
## hydropathy
hydropathy<-c('hydrophobic','hydrophobic','hydrophilic','
hydrophilic','hydrophobic','neutral','neutral',
'hydrophobic','hydrophilic','hydrophobic','hydrophobic',
'hydrophilic','neutral',
'hydrophilic','hydrophilic','neutral','neutral',
'hydrophobic','hydrophobic','neutral')
## vol
vol.cat<-c('verysmall','small','small','medium',
'verylarge','verysmall','medium','large','large',
'large','large','small','small','medium','large',
'verysmall',
'small','medium','verylarge','verylarge')
## pol
pol.cat<-c('nonpolar','nonpolar','polar','polar',
'nonpolar','nonpolar','polar','nonpolar',
'polar','nonpolar','nonpolar','polar','nonpolar','polar',
'polar','polar','polar','nonpolar','nonpolar','polar')
## chemical
chemical<-c('aliphatic','sulfur','acidic','acidic','aromatic',
'aliphatic','basic','aliphatic','basic','aliphatic','sulfur',
'amide','aliphatic','amide','basic','hydroxyl','hydroxyl',
'aliphatic','aromatic','aromatic')
Build the dataframes.
### Build dataframe
aa_dat <- data.frame(aa,
MW.da,vol,
bulk, pol,
isoelec,H2Opho.34, H2Opho.35,
saaH2O, faal.fold, polar.req,
freq, charge, hydropathy,
vol.cat, pol.cat, chemical)
The code below creates a correlation matrix, which displays the correlation values between two indicated variables. Values closer to zero indicate lower correlation, while absolute values closer to 1 indicate high correlation.
cor_ <- round(cor(aa_dat[,-c(1,13:17)]),2)
diag(cor_) <- NA
cor_[upper.tri(cor_)] <- NA
cor_
## MW.da vol bulk pol isoelec H2Opho.34 H2Opho.35 saaH2O faal.fold
## MW.da NA NA NA NA NA NA NA NA NA
## vol 0.93 NA NA NA NA NA NA NA NA
## bulk 0.55 0.73 NA NA NA NA NA NA NA
## pol 0.29 0.24 -0.20 NA NA NA NA NA NA
## isoelec 0.20 0.36 0.08 0.27 NA NA NA NA NA
## H2Opho.34 -0.27 -0.08 0.44 -0.67 -0.20 NA NA NA NA
## H2Opho.35 -0.25 -0.16 0.32 -0.85 -0.26 0.85 NA NA NA
## saaH2O 0.97 0.99 0.64 0.29 0.35 -0.18 -0.23 NA NA
## faal.fold 0.11 0.18 0.49 -0.53 -0.18 0.84 0.79 0.12 NA
## polar.req -0.05 -0.19 -0.53 0.76 -0.11 -0.79 -0.87 -0.11 -0.81
## freq -0.52 -0.30 -0.04 -0.01 0.02 0.26 -0.02 -0.38 -0.18
## polar.req freq
## MW.da NA NA
## vol NA NA
## bulk NA NA
## pol NA NA
## isoelec NA NA
## H2Opho.34 NA NA
## H2Opho.35 NA NA
## saaH2O NA NA
## faal.fold NA NA
## polar.req NA NA
## freq 0.14 NA
The following code tells us the variables that have the highest correlation. saaH2O and Volume have the highest with r = 0.99, and from the correlation matrix we can see that frequency and polarity have the lowest correlation. with r = 0.01.
which(cor_== max(cor_, na.rm = T), arr.ind = T)
## row col
## saaH2O 8 2
Next we will create a scatterplot matrix, which allows us to graph and observe the relationships between multiple variables in one image. The code below does this for the variables molecular weight, volume, bulk, polarity, H2O Hydrophobicity, and polar requirement. From the plot, we can see that MW.da and vol as well as bulk and vol have highly positively correlated relationships. MW.da and bulk have slight positive correlation. Polarity has non-linear relationships with all the variables. All other correlations are low.
splom(~aa_dat[c(2:5, 7, 11)], data = aa_dat, superpanel = lattice.getOption("panel.pairs"),
key = list(title = "Amino Acid Data", columns= 6, text=list(c("Molecular Weight","Volume", "Bulk", "Polarity", "H2O Hyrophobicity", "Polar Requirement"))))
###Plot 1: Replication of Higgs and Attwood
This plot represents a visualization of volume vs isoelectric points of the amino acids. The amino acids are labelled by their one-letter abbreviation. There does not appear to be a high correlation between the two variables.
plot(vol ~ isoelec, data = aa_dat, #creates main plot
xlim = c(2,12),
ylim = c(40,180),
xlab = "Isoelectric Point (pI)", #adds y axis label
ylab = "Volume", #adds y-axis label
main = "Volume vs. Isoelectric Point", #adds title
col = 0)
text(vol ~ isoelec, #adds text labels on points
labels = aa,
data = aa_dat,
col = 1:20)
###Figure 2: HA Figure 2.8 with ggpubr
Next, plot volume vs isoelectric point using ggpubr
ggscatter(y = "vol",
x = "isoelec",
size = "bulk",
color = "freq",
data = aa_dat,
xlab = "Isoelectric Point",
ylab = "Volume",
title = "Volume vs Isoelectric Point")
###Figure 3: Highly Correlated Data The variables MW.da and volume are highly correlation. They are plotted below, with a regression line on the graph using the equation y=mx+b, where m is the slope and b is the y-intercept. The code also adds an ellipse, which adds a confidence interval to the graph based on probability.
ggscatter(data = aa_dat,
y = "MW.da",
x = "vol",
add = "reg.line", # adding line of best fit
ellipse = TRUE, # adding ellipse
cor.coef = TRUE, # adding correlation coefficient
xlab = "Volume",
ylab = "Molecular Weight (daltons)")
## `geom_smooth()` using formula 'y ~ x'
The variables faal.fold and polar.req have a non-linear relationship. They are plotted below with a loess line showing the relationship. The color corresponds to frequency and the size corresponds to bulk.
ggscatter(data = aa_dat,
y = "faal.fold",
x = "polar.req",
add = "loess",
xlab = "Polar Requirement",
ylab = "Fraction of Accessible Area Lost when Protein in Folded",
col = "freq",
size = "bulk")
## `geom_smooth()` using formula 'y ~ x'
###Figure 5: Non-Linear Relationship in Log Scale
the previous graph is recoded below but with log computations of the variables. This transforms a non-linear relationship into a linear relationship.
ggscatter(data= aa_dat,
y = "faal.fold",
x = "polar.req",
yscale = "log10",
xscale = "log10",
add = "reg.line",
xlab = "log10(Polar Requirement)",
ylab = "log10(Fraction of Accessible Area Lost when Protein in Folded)",
col = "freq",
size = "bulk")
## `geom_smooth()` using formula 'y ~ x'
###Figure 6: 3D Scatterplot
Next we will make a 3d scatterplot of the variables polar.req, faal.fold, and h20pho.34. Polar.req and faal.fold have a strong negative relationship, polar req and h20pho.34 have a strong negative relationship, and faal.fold and h20pho.34 have a strong positive relationship.
scatterplot3d(x = aa_dat$polar.req,
y = aa_dat$faal.fold,
z = aa_dat$H2Opho.34,
xlab = "Polar Requirement",
ylab = "Fraction of Accessible Area Lost when Protein in Folded",
zlab = "Hydrophobicity scale",
highlight.3d = TRUE,
type = "h")
###Principal Component Analysis Principal component analysis (PCA) is a visual representation of similarities and differences between objects. The further away the objects on the map, the more different or unrelated they are. When Attwood and Higgs performed PCA, they describe the process as “a way of combining the information from all eight [amino acid] properties into a two-dimensional graph” (Higgs and Attwood 2005, pg 26). We will perform PCA using base R and vegan programs.
First, we must eliminate variables that were not used by Attwood and Higgs, including categorical variables, MW.da, and frequency. We can do this by creating a new dataframe called aa_dat2 that only has the columns we want to keep for PCA.
##Data Prep
### Build dataframe
#no categorical variables, MW.da, or frequency
aa_dat2 <- data.frame(vol,
bulk, pol,
isoelec, H2Opho.34, H2Opho.35,
saaH2O, faal.fold, polar.req)
###Figure 7: Principal Components Analysis- base R
Next, we run PCA using base R.
pca1 <- prcomp(aa_dat2, scale. = TRUE)
biplot(pca1)
###Figure 8: Principal Components Analysis - Vegan
Next, we run PCA using vegan. When comparing this to attwood and higgs (2005) figure 2.10, they do not show the groupings of amino acids with similarities.
rda.out <- vegan::rda(aa_dat2,scale = TRUE) #scale=TRUE makes the plot larger and easier to read
rda_scores <-scores(rda.out)
biplot(rda.out, display = "sites", col = 0,)
orditorp(rda.out, display = "sites", labels = aa_dat$aa, cex = 1.2)
ordihull(rda.out,
group = aa_dat$charge,
col = 1:7,
lty = 1:7,
lwd =c(3,6))
###Cluster Analysis Cluster analysis shows groupings of variables and allows for visualization of similarities. Variables in the same cluster are more alike than variables in different clusters. Below, we use UPGMA, which is unweighted cluster analysis. This means that all distances contribute equally to the average distance that is obtained. This contrasts the more accurate method: WPGMA.
##Unweighted Cluster Analysis
Perform UPGMA analysis on aa_dat2. The euclidean distance is an arbitrary reference number assigned to a line or vector to compare it to other lines or vectors. we convert the output into a dendrogram to visualize, which looks very similar to Attwood and Higgs (2005) Plate 2.2 .
row.names(aa_dat2) = paste(aa_dat$aa, sep = ".")
dist_euc <- dist(aa_dat2,
method = "euclidean") #sets it to use UPGMA
cluster_analysis_aa <- hclust(dist_euc,
method = "average")
dendrogram_aa <- as.dendrogram(cluster_analysis_aa)
plot(dendrogram_aa, horiz = T)