Introduction

This script forms a analysis of the 20 Amino Acids and their various data sets and properties. It takes those properties and categorizes the 20 Amino Acids based on similarities and differences, because similar amino acids tend to have the same functions so we can cluster them together. This script uses PCA, UPGMA, Scatterplot matrixes and other R functions.It is a relication of the work done by Higgs and Atwood “Amino Acid Chemistry” paper.

Preliminaries

Packages

ggplot2 creates elegant and better visualized plots than the default R plots, ggpubr allows us to add additional modifications and formatting to our ggplot2 plots.

library(ggpubr) #is a package that creates "wrappers" for much of ggplot2's core functionality

## Warning: package 'ggpubr' was built under R version 4.0.3

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 4.0.3

library(ggplot2) #Creates elegant data visualizations and plots using the grammar of graphics
library(permute) #Functions for generating restricted permuations of data

## Warning: package 'permute' was built under R version 4.0.3

library(lattice) #a powerful and elegant high-level data visualization system
library(vegan) #package used for multivariable analysis

## Warning: package 'vegan' was built under R version 4.0.3

## This is vegan 2.5-6

library(scatterplot3d) #used to make 3D plots

## Warning: package 'scatterplot3d' was built under R version 4.0.3

Build the dataframe

this data represents all the amino acids found in nature. it details each amino acid and their properties, such as bulk, polarity, molecular weight etc.

# 1 letter code
aa <-c('A','C','D','E','F','G','H','I','K','L','M','N',
'P','Q','R','S','T','V','W','Y')

## molecular weight in dalton
MW.da <-c(89,121,133,146,165,75,155,131,146,131,149,132,115,147,174,105,119,117,204,181)

## vol from van der Waals radii
vol <-c(67,86,91,109,135,48,118,124,135,124,124,96,90,
114,148,73,93,105,163,141)

## bulk – a measure of the shape of the side chain
bulk <-c(11.5,13.46,11.68,13.57,19.8,3.4,13.69,21.4,
15.71,21.4,16.25,12.28,17.43,
14.45,14.28,9.47,15.77,21.57,21.67,18.03)

## pol – a measure of the electric field strength around the molecule
pol <-c(0,1.48,49.7,49.9,0.35,0,51.6,0.13,49.5,0.13,
1.43,3.38,1.58,3.53,52,1.67,1.66,0.13,2.1,1.61)

## isoelec point
isoelec <-c(6,5.07,2.77,3.22,5.48,5.97,7.59,6.02,9.74,5.98,
5.74,5.41,6.3,5.65,10.76,5.68,6.16,5.96,5.89,5.66)

## 1st Hydrophobicity scale
H2Opho.34 <-c(1.8,2.5,-3.5,-3.5,2.8,-0.4,-3.2,4.5,-3.9,3.8,1.9,
-3.5,-1.6,-3.5,-4.5,-0.8,-0.7,4.2,-0.9,-1.3)

## 2nd Hydrophobicity scale
H2Opho.35 <-c(1.6,2,-9.2,-8.2,3.7,1,-3,3.1,-8.8,2.8,3.4,-4.8,
-0.2,-4.1,-12.3,0.6,1.2,2.6,1.9,-0.7)

## Surface area accessible to water in an unfolded peptide
saaH2O <-c(113,140,151,183,218,85,194,182,211,180,204,158,
143,189,241,122,146,160,259,229)

## Fraction of accessible area lost when a protein folds
faal.fold <-c(0.74,0.91,0.62,0.62,0.88,0.72,0.78,0.88,0.52,
0.85,0.85,0.63,0.64,0.62,0.64,0.66,0.7,0.86,0.85,0.76)

# Polar requirement
polar.req <-c(7,4.8,13,12.5,5,7.9,8.4,4.9,10.1,4.9,5.3,10,
6.6,8.6,9.1,7.5,6.6,5.6,5.2,5.4)

## relative frequency of occurance
## "The frequencies column shows the mean
### percentage of each amino acid in the protein sequences
### of modern organisms"
freq <-c(7.8,1.1,5.19,6.72,4.39,6.77,2.03,6.95,6.32,
10.15,2.28,4.37,4.26,3.45,5.23,6.46,5.12,7.01,1.09,3.3)

# charges
## un = Un-charged
## neg = negative
## pos = positive
charge<-c('un','un','neg','neg','un','un','pos','un','pos',
'un','un','un','un','un','pos','un','un','un','un','un')

## hydropathy
hydropathy<-c('hydrophobic','hydrophobic','hydrophilic','
hydrophilic','hydrophobic','neutral','neutral',
'hydrophobic','hydrophilic','hydrophobic','hydrophobic',
'hydrophilic','neutral',
'hydrophilic','hydrophilic','neutral','neutral',
'hydrophobic','hydrophobic','neutral')

## vol
vol.cat<-c('verysmall','small','small','medium',
'verylarge','verysmall','medium','large','large',
'large','large','small','small','medium','large',
'verysmall',
'small','medium','verylarge','verylarge')

## pol
pol.cat<-c('nonpolar','nonpolar','polar','polar',
'nonpolar','nonpolar','polar','nonpolar',
'polar','nonpolar','nonpolar','polar','nonpolar','polar',
'polar','polar','polar','nonpolar','nonpolar','polar')

## chemical
chemical<-c('aliphatic','sulfur','acidic','acidic','aromatic',
'aliphatic','basic','aliphatic','basic','aliphatic','sulfur',
'amide','aliphatic','amide','basic','hydroxyl','hydroxyl',
'aliphatic','aromatic','aromatic')

#Building data frame
aa_dat <- data.frame(aa,MW.da,vol,bulk, pol,isoelec, H2Opho.34, H2Opho.35, saaH2O, faal.fold, polar.req, freq, charge, hydropathy, vol.cat, pol.cat, chemical)

Raw Data Exploration

Exploring Data with Correlation/ Correlation Matrix

cor_ <- round(cor(aa_dat[,-c(1,13:17)]),2)
diag(cor_) <- NA
cor_[upper.tri(cor_)] <- NA
cor_

##           MW.da   vol  bulk   pol isoelec H2Opho.34 H2Opho.35 saaH2O faal.fold
## MW.da        NA    NA    NA    NA      NA        NA        NA     NA        NA
## vol        0.93    NA    NA    NA      NA        NA        NA     NA        NA
## bulk       0.55  0.73    NA    NA      NA        NA        NA     NA        NA
## pol        0.29  0.24 -0.20    NA      NA        NA        NA     NA        NA
## isoelec    0.20  0.36  0.08  0.27      NA        NA        NA     NA        NA
## H2Opho.34 -0.27 -0.08  0.44 -0.67   -0.20        NA        NA     NA        NA
## H2Opho.35 -0.25 -0.16  0.32 -0.85   -0.26      0.85        NA     NA        NA
## saaH2O     0.97  0.99  0.64  0.29    0.35     -0.18     -0.23     NA        NA
## faal.fold  0.11  0.18  0.49 -0.53   -0.18      0.84      0.79   0.12        NA
## polar.req -0.05 -0.19 -0.53  0.76   -0.11     -0.79     -0.87  -0.11     -0.81
## freq      -0.52 -0.30 -0.04 -0.01    0.02      0.26     -0.02  -0.38     -0.18
##           polar.req freq
## MW.da            NA   NA
## vol              NA   NA
## bulk             NA   NA
## pol              NA   NA
## isoelec          NA   NA
## H2Opho.34        NA   NA
## H2Opho.35        NA   NA
## saaH2O           NA   NA
## faal.fold        NA   NA
## polar.req        NA   NA
## freq           0.14   NA

The code below explores which columns have the most positive correlation and which have the most negative correlation between them.

SaaH2O and Vol have the most postive correlation. Polar.req and H2Ophobe.35 have the most negative

which(cor_ == max(cor_, na.rm = T), arr.ind = T)

##        row col
## saaH2O   8   2

max(cor_ , na.rm = T)

## [1] 0.99

which(cor_ == min(cor_, na.rm = T), arr.ind = T)

##           row col
## polar.req  10   7

min(cor_ , na.rm = T)

## [1] -0.87

Plotting the data

Plot 0: Scatterplot matrix

pairs(~MW.da + vol + pol + H2Opho.34 + polar.req + saaH2O + faal.fold, 
      data= aa_dat)

This function allows us to view the correlation between our varialbles with 0 being no correlation and 1 being highest correlation. It also makes the high correlation numbers bigger and makes the low correlation numbers smaller for easier visual. It also creates a line of best fit for the graphs

panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor,...) #5 arguments
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r)
}

pairs(~MW.da + vol + pol + H2Opho.34 + polar.req + saaH2O + faal.fold, 
      data= aa_dat,
      lower.panel = panel.smooth ,
      upper.panel = panel.cor , 
      gap = 0
      )

We can conclude that there are strong correlations between: MW.da and vol, MW.da and saaH2O, Vol and saaH2O, each of them have a correlation of <90, there are other strong correlations (80-89). some weak correlations are: polar.req and MW.da, andsaaH2O and polar.req, H2Ophobe.34 and vol. Some examples of non-linear relationships are:Polar.req and faal.fold , pol and any other variable (non have linear relationship)

Plot 1: Replication Higgs and Attwood Figure 2.8

Figure 1

The plot showcases the relationship between the Volume and the Isoelectric point variables. this graph also indicates that acidic amino acids have a low Isoelectric point, while basic Amino Acids have a high Isoelectric point

plot(vol~isoelec, data= aa_dat, #this command plots the actual graph
     xlab = "Pl",
     ylab = "Volume",
     main = "Plot of Amino Acid Volume Against Pl",
     col = 0)
text(vol ~ isoelec, #this command is adding text labels as seen by labels = aa, aa is the 1 letter                            abbreviation of our amino acids
     labels = aa,
     data = aa_dat,
     col = 1:20)

Figure 2: HA Figure 2.8 with ggpubr

ggscatter(y = "vol",
          x = "isoelec",
          size = "MW.da",
          color = "polar.req",
          data = aa_dat,
          xlab = "Isoelectric Point",
          ylab = "Volume")

Figure 3: Highly correlated data

Plot 2 variables with positive correlation using ggscatter to plot volume and MW.da

N = B0 +B1T, and the ellipse represents the standard deviation of the 2 variables, the correlation coeffiecent tells us that they are very correlated, fianlly the line of best fit shows up the linearity of the variables

ggscatter(y = "vol",
          x = "saaH2O",
          add = "reg.line", #Line of best fit
          ellipse = TRUE, #Data ellipse
          cor.coef = TRUE, #Correlation coefficient
          data = aa_dat,
          xlab = "saaH2O",
          ylab = "Volume")

## `geom_smooth()` using formula 'y ~ x'

Figure 4: Non-linear relationship

creating a graph out of 2 non-linear variables, while also using a smoother to showcases that the relationship is not linear.

ggscatter(data = aa_dat,
          y = "vol",
          x = "H2Opho.34",
          size = "MW.da",   
          color = "pol",
          add = "loess") +
  xlab("H2O Phobia")+
  ylab("Volume")

## `geom_smooth()` using formula 'y ~ x'

Figure 5: Non-linear relationship on LOG scale

The log functions can transform non-linear relationships to linear one by putting them in a logarithmic scale, we are able to examine correlation better with this scale

aa_dat$log_vol<-log(vol)
aa_dat$log_H2Opho.34<-log(H2Opho.34)

## Warning in log(H2Opho.34): NaNs produced

ggscatter(data = aa_dat,
          y = "log_vol",
          x = "log_H2Opho.34",
          size = "bulk",   
          color = "freq",
          add = "reg.line") +
  xlab("log of H2O phobia") +
  ylab("log of Volume")

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 13 rows containing non-finite values (stat_smooth).

## Warning: Removed 13 rows containing missing values (geom_point).

Figure 6: 3D Scatterplot

The 3D graph does not seem to show correlation between the 3 different variables since they all lie in different areas of the graph. a strong correlation would be if the majority of their datas are grouped in the same spot

scatterplot3d(x = aa_dat$saaH2O,
                     y = aa_dat$MW.da,
                     z = aa_dat$svol,
              highlight.3d = TRUE,
              angle = 100,
              type="h")

Principal Component Analysis

Principal Component Analysis, shortened to PCA, is a analysis method used commonly by computational biology researchers. It’s basic idea is plotting multileveled data (such as 8 variables) and scaling them down to a 2D graph which showcases related variables clustered together. Higgs Atwood, who is a renowned computational biology professor, wrote in his book “The PCA method transforms this cloud of points first by scaling them and shifting them to the origin, and then by rotating them in such a way that the points are spread out as much as possible, and the structure in the data is made easier to see.” We have two ways of utilizing PCA in R. Either through the base PCA function in R, or through the vegan package which contains the function rda().

Data prep

In our Dataset, we have 11 variables that contain properties of amino acids, including non numeric ones. our first step is scaling back this large dataset to 8 variables which are mentioned in Higgs and Attwood data table. coincidentally only numeric data is in the table

aa_dat2 = data.frame(vol, bulk, pol, isoelec,H2Opho.34, H2Opho.35, saaH2O, faal.fold)

Figure 7: Principal Component Analysis- Base R

Run a PCA and create a PCA plot using base R command.

pca.out <- prcomp(aa_dat2[,-c(7,8)], scale = TRUE)

biplot(pca.out)

Figure 8: Principal component analysis - vegan

Run the PCA analysis through vegan

rda.out <- vegan::rda(aa_dat2, scale = TRUE) #eigenvalues to display the importance of axes and to describe the true configuration of points. that means they change by a scaler value

Extract the known PCA score

rda_scores <- scores(rda.out)

Make a “biplot” (I usually call this a PCA plot

biplot(rda.out, display = "sites", col = 0,)
orditorp(rda.out, display = "sites", labels = aa_dat$aa, cex = 1.2)
ordihull(rda.out,
group = aa_dat$hydropathy,
col = 1:7,
lty = 1:7,
lwd = c(3,6))

The figure in Higgs Attwood (figure 2.10) is different than our figure for example our component numbers are different, and using vegan, we highlighted the most likely grouping while in Higgs plot no highlighting is made.

Cluster Analsysis

Cluster analysis allows to create meaningful subgroups of large data sets, these groups are highly correlated together and have similar properties.

Hierarchical clustering method

The UPGMA algorithm constructs a rooted tree that reflects the structure present in a pairwise dismimilarty matrix. At each step, the nearest two clusters are combined into a higher-level cluster. The distance between any two clusters is taken to the mean distance of all elements in each cluster that is made

Adding the row names

row.names(aa_dat2) <- paste(aa_dat$aa, sep = "")
row.names(aa_dat2)

##  [1] "A" "C" "D" "E" "F" "G" "H" "I" "K" "L" "M" "N" "P" "Q" "R" "S" "T" "V" "W"
## [20] "Y"

A distance matrix creates a two dimensional mattrix, in which distance represents the similarities or differences between each point, with similarities being a high number on a 0 to 1 scale and differences being vice-versa

Euclidean is a geometric analysis which seeks to understand the geometry of flat, two-dimensional spaces in which any 2 points can be joined by a straight line. it follows Euclid postulates

dist_euc <- dist(aa_dat2[,-9], 
                 method = "euclidean")

A cluster analysis can be done with hclust() function with the UPGMA algorithm.

clust_euc <- hclust(dist_euc, method = "average") #the method "average" performs UPGMA cluster analysis

Plot the dendrogram

par(mar = c(1,1,1,1))
dendro_euc <- as.dendrogram(clust_euc)
plot(dendro_euc,horiz = T,nodePar = list(pch = c(1,NA), 
                                          cex = 0.7, 
                                          lab.cex = 1))

Replicating Higgs and Attwood’s analysis on the properties of amino acids

Paul Sharoubeem, with contributions from Nathan Brouwer

11/16/2020