Replicating Higgs and Attwood’s analysis on the properties of amino acids

##Introduction

Studying and analyzing Higgs and Attwood’s amino acid analysis proved very useful in understanding how useful R is for manipulating, visualizing and understanding data. It also clearly displays how easy it is to find trends and other interesting factors hidden in data that isn’t put into a visual display format. In this document a few of R’s plotting mechanisms will be used, centering around ggpubr. Key functions that are used is the scatterplot3D(), plot(), and, ggscatter(). All of these are very customizable and can utilize an array of different visual aids to help understand Higgs and Attwood’s data even clearer.

##Preliminaries

The package permute: contains functions used to generate permutations of data The package lattice: contains graphics needed to make certain plots and charts

Two important and highly functional packages are ggpubR and ggplot2. These have an extensive array of functions and abilities to make data analysis easier. ggpubr makes using the functions of ggplot2 much more user friendly.

#installing packages

#install.packages("permute")
#install.packages("lattice")
#in#stall.packages("vegan")
#install.packages("ggpubr")
#install.packages("colorspace")

#loading packages

```{r. echo = F} library(ggplot2) # plotting functions library(ggpubr) # wrapper for GGplot2 library(vegan) # functions for plotting PCA library(scatterplot3d) # 3D scatterplot


##Data

These data are derived mostly from Higgs (2009) Table 1. They are similar to Higgs and Attwood 2005 but
have extra columns. Nathan Brouwer added additional categorical varibles based on online information.


```r
aa <-c('A','C','D','E','F','G','H','I','K','L','M','N',
'P','Q','R','S','T','V','W','Y')

## molecular weight in dalton
MW.da <-c(89,121,133,146,165,75,155,131,146,131,149,132,115,147,174,105,119,117,204,181)
##pI values
pI <- c(6.00, 5.05, 2.77, 3.22,5.48,5.97,7.59, 6.02, 9.74, 5.98, 5.74, 5.41, 6.30,5.65, 10.76, 5.68, 5.66, 5.96, 5.89, 5.66)
## vol from van der Waals radii
vol <-c(67,86,91,109,135,48,118,124,135,124,124,96,90,
114,148,73,93,105,163,141)

## bulk – a measure of the shape of the side chain
bulk <-c(11.5,13.46,11.68,13.57,19.8,3.4,13.69,21.4,
15.71,21.4,16.25,12.28,17.43,
14.45,14.28,9.47,15.77,21.57,21.67,18.03)

## pol – a measure of the electric field strength around the molecule
pol <-c(0,1.48,49.7,49.9,0.35,0,51.6,0.13,49.5,0.13,
1.43,3.38,1.58,3.53,52,1.67,1.66,0.13,2.1,1.61)

## isoelec point
isoelec <-c(6,5.07,2.77,3.22,5.48,5.97,7.59,6.02,9.74,5.98,
5.74,5.41,6.3,5.65,10.76,5.68,6.16,5.96,5.89,5.66)

## 1st Hydrophobicity scale
H2Opho.34 <-c(1.8,2.5,-3.5,-3.5,2.8,-0.4,-3.2,4.5,-3.9,3.8,1.9,
-3.5,-1.6,-3.5,-4.5,-0.8,-0.7,4.2,-0.9,-1.3)
## 2nd Hydrophobicity scale
H2Opho.35 <-c(1.6,2,-9.2,-8.2,3.7,1,-3,3.1,-8.8,2.8,3.4,-4.8,
-0.2,-4.1,-12.3,0.6,1.2,2.6,1.9,-0.7)

## Surface area accessible to water in an unfolded peptide
saaH2O <-c(113,140,151,183,218,85,194,182,211,180,204,158,
143,189,241,122,146,160,259,229)

## Fraction of accessible area lost when a protein folds
faal.fold <-c(0.74,0.91,0.62,0.62,0.88,0.72,0.78,0.88,0.52,
0.85,0.85,0.63,0.64,0.62,0.64,0.66,0.7,0.86,0.85,0.76)

# Polar requirement
polar.req <-c(7,4.8,13,12.5,5,7.9,8.4,4.9,10.1,4.9,5.3,10,
6.6,8.6,9.1,7.5,6.6,5.6,5.2,5.4)

## relative frequency of occurance
## "The frequencies column shows the mean
### percentage of each amino acid in the protein sequences
### of modern organisms"
freq <-c(7.8,1.1,5.19,6.72,4.39,6.77,2.03,6.95,6.32,
10.15,2.28,4.37,4.26,3.45,5.23,6.46,5.12,7.01,1.09,3.3)
##
# charges
## un = Un-charged
## neg = negative
## pos = positive
charge<-c('un','un','neg','neg','un','un','pos','un','pos',
'un','un','un','un','un','pos','un','un','un','un','un')

pol.log <- log(pol)
MW.da.log <- log(MW.da)
hydropathy<-c('hydrophobic','hydrophobic','hydrophilic','
hydrophilic','hydrophobic','neutral','neutral',
'hydrophobic','hydrophilic','hydrophobic','hydrophobic',
'hydrophilic','neutral',
'hydrophilic','hydrophilic','neutral','neutral',
'hydrophobic','hydrophobic','neutral')

vol.cat<-c('verysmall','small','small','medium',
'verylarge','verysmall','medium','large','large',
'large','large','small','small','medium','large',
'verysmall',
'small','medium','verylarge','verylarge')

pol.cat<-c('nonpolar','nonpolar','polar','polar',
'nonpolar','nonpolar','polar','nonpolar',
'polar','nonpolar','nonpolar','polar','nonpolar','polar',
'polar','polar','polar','nonpolar','nonpolar','polar')

chemical<-c('aliphatic','sulfur','acidic','acidic','aromatic',
'aliphatic','basic','aliphatic','basic','aliphatic','sulfur',
'amide','aliphatic','amide','basic','hydroxyl','hydroxyl',
'aliphatic','aromatic','aromatic')

##Build the Dataframe

aa_dat <- data.frame(aa, MW.da,MW.da.log, pI, vol, bulk, pol,pol.log, isoelec, H2Opho.34, H2Opho.35, saaH2O, faal.fold, polar.req, freq)

##Raw data exploration

The code below is creating a correlation matrix using the provided data of variables describing the amino acids.

cor_ <- round(cor(aa_dat[,-c(1,13:17)]),2)
diag(cor_) <- NA
cor_[upper.tri(cor_)] <- NA
cor_

##           MW.da MW.da.log    pI   vol  bulk   pol pol.log isoelec H2Opho.34
## MW.da        NA        NA    NA    NA    NA    NA      NA      NA        NA
## MW.da.log  0.99        NA    NA    NA    NA    NA      NA      NA        NA
## pI         0.21      0.18    NA    NA    NA    NA      NA      NA        NA
## vol        0.93      0.94  0.37    NA    NA    NA      NA      NA        NA
## bulk       0.55      0.60  0.08  0.73    NA    NA      NA      NA        NA
## pol        0.29      0.32  0.27  0.24 -0.20    NA      NA      NA        NA
## pol.log     NaN       NaN   NaN   NaN   NaN   NaN      NA      NA        NA
## isoelec    0.20      0.18  1.00  0.36  0.08  0.27     NaN      NA        NA
## H2Opho.34 -0.27     -0.26 -0.20 -0.08  0.44 -0.67     NaN   -0.20        NA
## H2Opho.35 -0.25     -0.27 -0.27 -0.16  0.32 -0.85     NaN   -0.26      0.85
## saaH2O     0.97      0.96  0.36  0.99  0.64  0.29     NaN    0.35     -0.18
##           H2Opho.35 saaH2O
## MW.da            NA     NA
## MW.da.log        NA     NA
## pI               NA     NA
## vol              NA     NA
## bulk             NA     NA
## pol              NA     NA
## pol.log          NA     NA
## isoelec          NA     NA
## H2Opho.34        NA     NA
## H2Opho.35        NA     NA
## saaH2O        -0.23     NA

The code below is returning the row and column point where the maximum correlation is viewed. For our data is will return row 8 and column 9, which is a correlation of 0.99. Most positive correlation: saaH20 and vol Most negative correlation: polar.req and H20pho.35

which(cor_ == max(cor_, na.rm = T), arr.ind = T)

##         row col
## isoelec   8   3

##Plot 0: Scatterplot matrix Scatterplot matrix including the following variables: 1. MW.da 2. vol 3. pol 4. H2Opho.34 5. polar.req 6. freq 7. bulk

plot(aa_dat[,c("MW.da", "vol","pol","H2Opho.34","polar.req", "freq", "bulk")],
     panel = panel.smooth, main = "Plot 0: Scatterplot Matrix")

The code chunk below is important code, despite looking relatively clunky. It is an important function when calculating the correlation coefficient matrix for pair or data. It takes in arguments of x,y which are the data sets, as well as number of digits to plot, a prefix for the coefficients and cex.cor, which is the character expansion factor.

panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor,...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r)
}

After observing the results of the correlations of all of the 7 variables compared in the matrix, it is clear which variables are correlated. The variables MW.da, vol and bulk all have posite correlations with each other. H2Opho.34 and polar.req have a clear negative correlation. Some data have irregular correlations with no real trend being viewed. This occurs with H2Opho.34 and vol, as well as with MW.da. A few plots show interestingly distributed data, with points being on either far end of the plot. This is observed in all of the plots involving polarity.

##Plot 1: Replication Higgs and Attwood Figure 2.8

PLot 1 is a simplistic recreation og Higgs and Attwood’s figure 2.8. It is effective at displaying the very positive correlation between volume and pI of amino acids. Clearly amino acids iwth larger volumes will in turn have a higher pI.

ggpubr::ggscatter(y = "vol", x = "pI", data = aa_dat,font.label = c(10, "plain"), point = TRUE, size = 1, label = TRUE, 
          xlab = "pI", ylab = "Volume", title = "Plot 1: Volume vs. pI")

#Figure 2: HA Figure 2.8 with ggpubr

Figure 2 displays a positively correlated scatterplot comparing the variables molecular weight and volume.

ggpubr::ggscatter(y = "MW.da",
x = "vol",
size = "polar.req",
color = "polar.req",
shape = "square",
data = aa_dat,
xlab = "Volume",
ylab = "Molecular Weight",
title = "Figure 2")

##Figure 3: Highly correlated data

Identify 2 variables that are highly positively correlated and plot them in GGPUBR: vol and MW.da
added a regression line
added a data ellipse
added a correlation coefficient

#Regression The command add = “line.reg” adds a regression line to the plot. This line is in the form y = mx +b, or the commonly used statistical format y = B0 + B1*x. #Elipse In addition to the regression line, a data ellipse was added and it helps predict where all of the data will lie according to the already plotted data points.

ggpubr::ggscatter(y = "vol",
x = "MW.da",
size = "vol",
add = "reg.line", # line of best fit
ellipse = TRUE, # data ellipse
cor.coef = TRUE, # correlation coef
# color = "freq",
data = aa_dat,
xlab = "",
ylab = "", title = "Figure 3")

## `geom_smooth()` using formula 'y ~ x'

##Figure 4: Non-linear relationship

Figure 4 displays the scatter plot created with ggpubr comparing two non-linearly related variables. A loess smoother was added. Size and color varies according to the polarity variable.

ggpubr::ggscatter(data = aa_dat, x = "MW.da", y = "pol",size = "pol", color = "pol", fill = "lightgray", add = "loess", xlab = "Molecular Weight", ylab = "Polarity", title = "Figure 4")

## `geom_smooth()` using formula 'y ~ x'

##Figure 5: Non-linear relationship on LOG scale

Here is the abnormal data from figure 4 plotted logarithmically. This makes it easier to understand the data in a linear sense.

pol.log <- log(pol) #calcs log of pol vector 
MW.da.log <- log(MW.da) #calcs log of the MW.da vector 

ggpubr::ggscatter(data = aa_dat, x = "MW.da.log", y = "pol.log",size = "pol.log", color = "pol", add = "reg.line", xlab = "LN of Molecular Weight", ylab = "LN of Polarity", title = "Figure 4")

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 2 rows containing non-finite values (stat_smooth).

## Warning in sqrt(x): NaNs produced

## Warning: Removed 2 rows containing missing values (geom_point).

##Figure 6: 3D Scatterplot

The three variables I chose for the 3D scatterplot are MW.da, vol and bulk. These three data are the most positively correlated data of all the variables in the dataframe. The 3D scatterplot shows the overall upwards trend of these three variables.

scatterplot3d::scatterplot3d(aa_dat[, c("MW.da","vol","bulk")], highlight.3d = TRUE, type = "h")

#Principal component analysis

PCA is a helpful tool to use information from a multitude of variables in one graph.Higgs and Attwood used this method when they wanted to show the relation of acidity, hydrophobicity and pI. They stated that PCA could be used to incorperate all 8 variables: “What we need is a way of combining the information from all eight properties into a two-dimensional graph. This can be done with principal component analysis (PCA)”(Higgs and Attwood 2005, pg 26). 2 basic ways we made PCA plots in this class are using the basic R commands and also by using the vegan package.

#Data prep

Creating a dataframe consisting of only the numerical data from Higgs and Atwood’s data:

aa_dat2 <- data.frame(vol, bulk, pol, pI, H2Opho.34, H2Opho.35, saaH2O, faal.fold)

This new dataframe created above simply takes the prior data and create a new dataframe with only numeric data, and only the variables considered in table 2.2 from Higgs and Atwood’s paper on page 24.

##Figure 7: Principal components analysis - base R

Run a PCA and create a PCA plot using base R command.

aa_dat2.pca <- princomp(aa_dat2)
biplot(aa_dat2.pca)

##Figure 8: Principal components analysis - vegan

rda.out <- vegan::rda(aa_dat2, scale =TRUE) ## <--- centers the data and makes the
biplot(rda.out)                             ##      scale more congruent and sensible

rda_scores <- vegan::scores(rda.out)
rda_scores

## $species
##                  PC1         PC2
## vol        0.1356099 -1.21785932
## bulk      -0.5196719 -1.00142163
## pol        1.0407658 -0.21665973
## pI         0.4351231 -0.52606872
## H2Opho.34 -1.1566223 -0.06803928
## H2Opho.35 -1.1916485  0.06980511
## saaH2O     0.2356071 -1.17813587
## faal.fold -1.0639131 -0.35830409
## 
## $sites
##                PC1        PC2
## sit1  -0.434294425  1.0063944
## sit2  -0.813549384  0.4986821
## sit3   0.954680934  0.7207667
## sit4   0.939818780  0.2301841
## sit5  -0.862556271 -0.7658451
## sit6  -0.099385037  1.7887438
## sit7   0.646214109 -0.3324241
## sit8  -1.009622881 -0.5674630
## sit9   1.455246151 -0.6722893
## sit10 -0.900560477 -0.5283670
## sit11 -0.665476480 -0.3930415
## sit12  0.460467547  0.5319188
## sit13  0.006879387  0.3663943
## sit14  0.459178376  0.0542047
## sit15  1.587076895 -1.0794305
## sit16 -0.018152123  1.0803175
## sit17 -0.207658131  0.4021859
## sit18 -0.976206146 -0.2399968
## sit19 -0.414016908 -1.3759388
## sit20 -0.108083916 -0.7249964
## 
## attr(,"const")
## [1] 3.511243

#Another biplot

Below is another biplot. As compared to figure 2.10 from Higgs and Atwoods paper, there are a few differences. For one, the data on this one is more centered as I used a different grouping variable. This plot doesnt consist of point, just the labels. Their plot also has gridlines, a gray fill, and more precise margins.

rda.out <- vegan::rda(aa_dat2,
scale = TRUE)

rda_scores <- vegan::scores(rda.out)

biplot(rda.out, display = "sites", col = 0,)
vegan::orditorp(rda.out, display = "sites", labels = aa_dat$aa, cex = 1.2)

vegan::ordihull(rda.out,
group = aa_dat$charge,
col = 1:7,
lty = 1:7,
lwd = c(3,6))

##Cluster analysis

CLuster analysis is a helpful tool to use information from a multitude of variables in one graph. It uses a compound of data to display similarities and differences of multiple variables at once. UPGMA is a hierarchical clustering method. It is very useful when generating phylogenetic trees as it is good at finding similarities between data and grouping them together. UPGMA is a relatively simple way of doing this, but it is also relatively easy to accomplish in R as well.

#Hierarchical cluster analysis

NOTE: I use the aa_dat2 data we used above for all of this.

dist_euc <- dist(aa_dat2, method = "euclidean")
row.names(aa_dat2) <- paste(aa_dat2$MW.da,1:nrow(aa_dat2),sep = ".") #not sure if this is needed or not
plot(dist_euc)

A distance matrix is used to display relative differences between pairwise values. Euclidean distance is the usual distance between the two vectors (2 norm aka L_2), sqrt(sum((x_i - y_i)^2)). (this information came from the R package stats version 4.0.2)

#Dendogram

clust_euc <- hclust(dist_euc, method = "average") #the "method  = "complete"" sets this to UPGMA
plot(clust_euc)

This cluster diagram differs from Higgs and Atwoods plate 2.2 in a few ways. Fer one, this one has a height scale and is labeled with numbers rather than letters. The formatting is also different as theirs is vertical and more streamlined.

I also didn’t include a heatmap.

Replicating Higgs and Attwood’s analysis on the properties of amino acids

Theresa Hutchins with contributions from Nathan Brouwer

11/12/2020