##Introduction This script is a replication of Higgs and Attwood’s analysis on the properties of amino aicds. This project will help us determine whether there is a correlation between amino acids and their relevance in protein folding, The amino acid property data used in this script is used with principal component analysis and clustering algorithms. Amino acids vary greatly in size, charge, hydrophobicity, and other physical properties. Principal component analysis (PCA) is a way of visualizing the important features of multidimensional data sets, such as tables of amino acid properties.
##Preliminaries Packages
library(ggplot2) # create graphs that represent both univariate and multivariate numerical and categorical data in a straightforward manner
library(ggpubr) # create easily publication-ready plots, makes ggplot2 easier
library(vegan) # PCA has functions doing/plotting PCA
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.5-6
library(scatterplot3d) # creates scatterplot in R
library(permute) # implementation of the permutation schemes
library(lattice) # number of different functions to create different types of plot
The ggplot2 package allows R to create many different types of graphical represenations of data, and the ggpubr package makes these graphs easily ready for publication. They go hand-in-hand.
##Build the dataframe These data are derived mostly from Higgs (2009) Table 1. They are similar to Higgs and Attwood 2005 but have extra columns. Nathan Brouwer added additional categorical varibles based on online information. Make the vectors.
# 1 letter code
aa <-c('A','C','D','E','F','G','H','I','K','L','M','N', 'P','Q','R','S','T','V','W','Y')
## molecular weight in dalton
MW.da <-c(89,121,133,146,165,75,155,131,146,131,149,132,115,147,174,105,119,117,204,181)
## vol from van der Waals radii
vol <-c(67,86,91,109,135,48,118,124,135,124,124,96,90, 114,148,73,93,105,163,141)
## bulk – a measure of the shape of the side chain
bulk <-c(11.5,13.46,11.68,13.57,19.8,3.4,13.69,21.4, 15.71,21.4,16.25,12.28,17.43,
14.45,14.28,9.47,15.77,21.57,21.67,18.03)
## pol – a measure of the electric field strength around the molecule
pol <-c(0,1.48,49.7,49.9,0.35,0,51.6,0.13,49.5,0.13, 1.43,3.38,1.58,3.53,52,1.67,1.66,0.13,2.1,1.61)
## isoelec point
isoelec <-c(6,5.07,2.77,3.22,5.48,5.97,7.59,6.02,9.74,5.98, 5.74,5.41,6.3,5.65,10.76,5.68,6.16,5.96,5.89,5.66)
## 1st Hydrophobicity scale
H2Opho.34 <-c(1.8,2.5,-3.5,-3.5,2.8,-0.4,-3.2,4.5,-3.9,3.8,1.9, -3.5,-1.6,-3.5,-4.5,-0.8,-0.7,4.2,-0.9,-1.3)
## 2nd Hydrophobicity scale
H2Opho.35 <-c(1.6,2,-9.2,-8.2,3.7,1,-3,3.1,-8.8,2.8,3.4,-4.8, -0.2,-4.1,-12.3,0.6,1.2,2.6,1.9,-0.7)
## Surface area accessible to water in an unfolded peptide
saaH2O <-c(113,140,151,183,218,85,194,182,211,180,204,158, 143,189,241,122,146,160,259,229)
## Fraction of accessible area lost when a protein folds
faal.fold <-c(0.74,0.91,0.62,0.62,0.88,0.72,0.78,0.88,0.52, 0.85,0.85,0.63,0.64,0.62,0.64,0.66,0.7,0.86,0.85,0.76)
# Polar requirement
polar.req <-c(7,4.8,13,12.5,5,7.9,8.4,4.9,10.1,4.9,5.3,10, 6.6,8.6,9.1,7.5,6.6,5.6,5.2,5.4)
## relative frequency of occurance
## "The frequencies column shows the mean
### percentage of each amino acid in the protein sequences ### of modern organisms"
freq <-c(7.8,1.1,5.19,6.72,4.39,6.77,2.03,6.95,6.32,
10.15,2.28,4.37,4.26,3.45,5.23,6.46,5.12,7.01,1.09,3.3)
##
# charges
## un = Un-charged
## neg = negative
## pos = positive charge<-c('un','un','neg','neg','un','un','pos','un','pos','un','un','un','un','un','pos','un','un','un','un','un')
## hydropathy
hydropathy<-c('hydrophobic','hydrophobic','hydrophilic',' hydrophilic','hydrophobic','neutral','neutral',
'hydrophobic','hydrophilic','hydrophobic','hydrophobic',
'hydrophilic','neutral',
'hydrophilic','hydrophilic','neutral','neutral',
'hydrophobic','hydrophobic','neutral')
## vol
vol.cat<-c('verysmall','small','small','medium', 'verylarge','verysmall','medium','large','large',
'large','large','small','small','medium','large',
'verysmall',
'small','medium','verylarge','verylarge')
## pol
pol.cat<-c('nonpolar','nonpolar','polar','polar', 'nonpolar','nonpolar','polar','nonpolar',
'polar','nonpolar','nonpolar','polar','nonpolar','polar',
'polar','polar','polar','nonpolar','nonpolar','polar')
## chemical
chemical<-c('aliphatic','sulfur','acidic','acidic','aromatic', 'aliphatic','basic','aliphatic','basic','aliphatic','sulfur', 'amide','aliphatic','amide','basic','hydroxyl','hydroxyl', 'aliphatic','aromatic','aromatic')
Build the data frames. ### Build dataframe
aa_dat = data.frame(aa)
aa_dat <- data.frame(aa,
MW.da,vol,
bulk, pol,
isoelec, H2Opho.34, H2Opho.35,
saaH2O, faal.fold, polar.req,
freq, hydropathy,
vol.cat, pol.cat, chemical)
##creating a correlation table of all the variables in the data frame
cor_ <- round(cor(aa_dat[,-c(1,13:17)]),2)
diag(cor_) <- NA
cor_[upper.tri(cor_)] <- NA
cor_
## MW.da vol bulk pol isoelec H2Opho.34 H2Opho.35 saaH2O faal.fold
## MW.da NA NA NA NA NA NA NA NA NA
## vol 0.93 NA NA NA NA NA NA NA NA
## bulk 0.55 0.73 NA NA NA NA NA NA NA
## pol 0.29 0.24 -0.20 NA NA NA NA NA NA
## isoelec 0.20 0.36 0.08 0.27 NA NA NA NA NA
## H2Opho.34 -0.27 -0.08 0.44 -0.67 -0.20 NA NA NA NA
## H2Opho.35 -0.25 -0.16 0.32 -0.85 -0.26 0.85 NA NA NA
## saaH2O 0.97 0.99 0.64 0.29 0.35 -0.18 -0.23 NA NA
## faal.fold 0.11 0.18 0.49 -0.53 -0.18 0.84 0.79 0.12 NA
## polar.req -0.05 -0.19 -0.53 0.76 -0.11 -0.79 -0.87 -0.11 -0.81
## freq -0.52 -0.30 -0.04 -0.01 0.02 0.26 -0.02 -0.38 -0.18
## polar.req freq
## MW.da NA NA
## vol NA NA
## bulk NA NA
## pol NA NA
## isoelec NA NA
## H2Opho.34 NA NA
## H2Opho.35 NA NA
## saaH2O NA NA
## faal.fold NA NA
## polar.req NA NA
## freq 0.14 NA
The code below applies the which() function, which in this scenario idenitfies where in the table has the most negative correlation between amino acids
saaH2O and vol have the most positive correlation
polar.req and H2Opho.35 have the most negative correlation
# identifying which where in the table has the most negative correlation
which(cor_ == max(cor_, na.rm = T), arr.ind = T)
## row col
## saaH2O 8 2
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r)
}
## makes subset of scatter plot matrix
plot(aa_dat[,c("MW.da", "vol", "pol",
"H2Opho.34", "polar.req",
"saaH2O", "faal.fold")],
panel = panel.smooth,
upper.panel = panel.cor,
main = "Example")
## creates the dimensions of all the plots in the presentation and includes correlations
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor,...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits = digits)
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r)
}
According to the 6 variables I plotted, there is a large range of weak to strong correlations. The correlation between MW.da and polar.req is the weakest. The strongest correlation is between the variables vol and saaH2O.
This plot represent the relationship the polar.req (polar requirement) vs. freq(mean percentage of each amino acid in the protein sequences of modern organisms")
plot(polar.req - freq, data = aa_dat, #does main plotting
xlab = "freq", #labels x-axis
ylab = "polar req.", #labels y-axis
main = "polar req v. freq.",
col = 0)
## Warning in plot.window(...): "data" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "data" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not a
## graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not a
## graphical parameter
## Warning in box(...): "data" is not a graphical parameter
## Warning in title(...): "data" is not a graphical parameter
text(polar.req - freq,
labels = aa,
data = aa_dat,
col = 1:20)
## Warning in text.default(polar.req - freq, labels = aa, data = aa_dat, col =
## 1:20): "data" is not a graphical parameter
## Figure 2: HA Figure 2.8 with ggpubr
ggscatter(y = "polar.req", x = "freq",
size = "polar.req",
color = "polar.req",
data = aa_dat,
xlab = "freq",
ylab = "polar.req")
ggscatter(y = "vol",
x = "pol",
size = "vol",
color = "vol",
add = "loess",
data = aa_dat)
## `geom_smooth()` using formula 'y ~ x'
aa_dat$vol_log <- log(aa_dat$vol)
ggscatter(y = "vol",
x = "pol",
add = "reg.line",
cor.coef = TRUE,
size = "vol",
color = "vol",
data = aa_dat)
## `geom_smooth()` using formula 'y ~ x'
##Figure 6: 3D Scatterplot I chose to make a 3d scatterplot of the variavles bulk, pol, and freq. The data plots are rather close to each other and not incredibly scattered, which suggest that these 3 variables would have a strong correlation.
aa_dat$bulk_log <- log(aa_dat$bulk)
aa_dat$pol_log <- log(aa_dat$pol)
aa_dat$freq_log <- log(aa_dat$freq)
aa_dat_df <- data.frame(aa_dat)
scatterplot3d(aa_dat_df[, c("bulk_log",
"pol_log",
"freq_log")]) #highlight.3d = TRUE type="h" #<---- can't get these two to be working
According to Higgs and Attwood, PCA “transforms this cloud of points first by scaling them and shifting them to the origin, and then by rotating them in such a way that the points are spread out as much as possible, and the structure in the data is made easier to see.” (pg.26). One can do this by beginning with the data in the form of an N x P matrix, where each row in the data matrix can be thought of as the coordinates of a point in P-dimensional space. The PCA diagram manages to do a fairly good job at illustrating all these similarities at the same time.
To make this new dataframe, first assign the data to a new variable, which is then subsetted as a dataframe using ngative indexing, followed by column naming.
subset dataframe using negative indexing
aa_dat2 <- data.frame(aa_dat)
aa_dat2 <- aa_dat[,-c(1,
2,11,12,
13,14,15,16,17,
18)]
subset by naming the colums
aa_dat2 <- aa_dat[,c("vol" = 1,"bulk" = 2,"pol" = 11,"isoelec" = 12,
"h2Opho.34" = 13,"h2Opho.35" = 14,
"saaH2O" = 15, "faal.fold" = 16, "polar.req" = 17)]
##couldnt get code to work from here on ## Figure 7: Principal components analysis - base R Run a PCA and create a PCA plot using base R command.
#aa_dat2 <- na.omit(aa_dat2)
#pca.out <- prcomp(as.numeric(aa_dat2[1,2,11,12,13,14,15,16,17,18], scale = TRUE) )
Plot the output
##biplot(pca.out)
scale = TRUE means that scaling is done by dividing the (centered) columns of x by their standard deviations if center is TRUE, and the root mean square otherwise.
## rda.out <- vegan::rda(aa_dat2,
## scale = TRUE)
Extract what are known as the PCA score
##rda_scores <- scores(rda.out)
Make a “biplot” (I usually call this a PCA plot) using the code below.
There are many differences between my plot and the one in Figure 2.10, page 28, of Higgs and Attwood. The scale for the a and y axes are different, with this plot having larger numbers. The lay out of the data is also different, with Higg’s data being more scattered throguhout the graph whereas this one’s data values are more clumped together in the center of the graph.
## biplot(rda.out, display = "sites", col = 0,)
## orditorp(rda.out, display = "sites", labels = aa_dat$aa, cex = 1.2)
## ordihull(rda.out,
## group = aa_dat$hydropathy,
## col = 1:7,
## lty = 1:7,
## lwd = c(3,6))
The purpose of cluster analysis is to provide a form of data analysis in which observations are divided into different groups that share common characteristics. It performs the task of assigning some set of objects into the groups are also known as clusters.
UPGMA constructs a rooted tree that reflects the structure present in a pairwise similarity matrix, or a dissimilarity matrix. At each step, the nearest two clusters are combined into a higher-level cluster.
#Hierarchical Cluster Analysis The Euclidean distance of a line segment between the two points. It can be calculated from the Cartesian coordinates of the points using the Pythagorean theorem, and is occasionally called the Pythagorean distance.
##dist_euc <- dist(aa_dat2,
## method = "euclidean")
##clust_euc <- hclust(dist_euc,
## method = "average") #this part sets it to use UPGMA
My cluster diagram is different in many ways compared to the cluster diagram produced by Higgs and Atwood (2005) in Plate 2.2. The layout of the diagram is more flat and horizontal, while the one in the reading is more compressed and skinny. There are some similarities, like how there are 4 distinct, broad, large classes with multiple min clades within them.The color scheme is also different as the colors were more scattered on mine compared to the one in reading in which the colors were more distinct from each other.
## clust_euc <- hclust(dist_euc)
## dendro_euc <- as.dendrogram(clust_euc)
## plot(dendro_euc.horiz = T,
## nodePar = list(pch = c(1,NA),
## cex = 0.5,
## lab.cex = 1.2))