Introduction

The script below attempts to recreate the research of Higgs and Attwood about amino acids. Various scatterplots and dendrograms are created in order to understand different relationships between the variables. This is important to study because understanding amino acids, gives us insight to how proteins and other key biological functions take place. Additionally, learning about the relationship between different amino acids can help further our understanding of diseases, mutations and other alignments to the human body.

Preliminaries

Packages

ggpubr makes creating graphs and plots with ggplot2 easier. The package contains functions that allow the programmer to easily customize plots created by using ggplot2.

## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.5-6

Build the dataframe

These data are derived mostly from Higgs (2009) Table 1. They are similar to Higgs and Attwood 2005 but have extra columns. Nathan Brouwer added additional categorical varibles based on online information.

The data represents all necessary information about all the 20 amino acids.

Make the vectors.

# 1 letter code
aa        <-c('A','C','D','E','F','G','H','I','K','L','M','N',
              'P','Q','R','S','T','V','W','Y')


## molecular weight in dalton
MW.da     <-c(89,121,133,146,165,75,155,131,146,131,149,132,115,147,174,105,119,117,204,181)

## vol from van der Waals radii 
vol    <-c(67,86,91,109,135,48,118,124,135,124,124,96,90,
              114,148,73,93,105,163,141)

##  bulk – a measure of the shape of the side chain
bulk <-c(11.5,13.46,11.68,13.57,19.8,3.4,13.69,21.4,
              15.71,21.4,16.25,12.28,17.43,
              14.45,14.28,9.47,15.77,21.57,21.67,18.03)

## pol – a measure of the electric field strength around the molecule
pol  <-c(0,1.48,49.7,49.9,0.35,0,51.6,0.13,49.5,0.13,
              1.43,3.38,1.58,3.53,52,1.67,1.66,0.13,2.1,1.61)

## isoelec point
isoelec <-c(6,5.07,2.77,3.22,5.48,5.97,7.59,6.02,9.74,5.98,
                   5.74,5.41,6.3,5.65,10.76,5.68,6.16,5.96,5.89,5.66)

## 1st Hydrophobicity scale
H2Opho.34 <-c(1.8,2.5,-3.5,-3.5,2.8,-0.4,-3.2,4.5,-3.9,3.8,1.9,
                  -3.5,-1.6,-3.5,-4.5,-0.8,-0.7,4.2,-0.9,-1.3)

## 2nd Hydrophobicity scale
H2Opho.35 <-c(1.6,2,-9.2,-8.2,3.7,1,-3,3.1,-8.8,2.8,3.4,-4.8,
                  -0.2,-4.1,-12.3,0.6,1.2,2.6,1.9,-0.7)

## Surface area accessible to water in an unfolded peptide
saaH2O       <-c(113,140,151,183,218,85,194,182,211,180,204,158,
                 143,189,241,122,146,160,259,229)


## Fraction of accessible area lost when a protein folds
faal.fold    <-c(0.74,0.91,0.62,0.62,0.88,0.72,0.78,0.88,0.52,
                 0.85,0.85,0.63,0.64,0.62,0.64,0.66,0.7,0.86,0.85,0.76)



# Polar requirement 
polar.req    <-c(7,4.8,13,12.5,5,7.9,8.4,4.9,10.1,4.9,5.3,10,
                 6.6,8.6,9.1,7.5,6.6,5.6,5.2,5.4)

## relative frequency of occurance
## "The frequencies column shows the mean 
### percentage of each amino acid in the protein sequences 
### of modern organisms"
freq        <-c(7.8,1.1,5.19,6.72,4.39,6.77,2.03,6.95,6.32,
                10.15,2.28,4.37,4.26,3.45,5.23,6.46,5.12,7.01,1.09,3.3)

##
# charges
## un = Un-charged
## neg = negative
## pos = positive
charge<-c('un','un','neg','neg','un','un','pos','un','pos',
          'un','un','un','un','un','pos','un','un','un','un','un')

## hydropathy
hydropathy<-c('hydrophobic','hydrophobic','hydrophilic','
              hydrophilic','hydrophobic','neutral','neutral',
              'hydrophobic','hydrophilic','hydrophobic','hydrophobic',
              'hydrophilic','neutral',
              'hydrophilic','hydrophilic','neutral','neutral',
              'hydrophobic','hydrophobic','neutral')

## vol
vol.cat<-c('verysmall','small','small','medium',
              'verylarge','verysmall','medium','large','large',
              'large','large','small','small','medium','large',
              'verysmall',
              'small','medium','verylarge','verylarge')

## pol
pol.cat<-c('nonpolar','nonpolar','polar','polar',
                'nonpolar','nonpolar','polar','nonpolar',
                'polar','nonpolar','nonpolar','polar','nonpolar','polar',
                'polar','polar','polar','nonpolar','nonpolar','polar')

## chemical
chemical<-c('aliphatic','sulfur','acidic','acidic','aromatic',
            'aliphatic','basic','aliphatic','basic','aliphatic','sulfur',
            'amide','aliphatic','amide','basic','hydroxyl','hydroxyl',
            'aliphatic','aromatic','aromatic')

Build the dataframes.

### Build dataframe
aa_dat <- data.frame(aa, 
                     MW.da,vol,
                     bulk, pol, 
                     isoelec,H2Opho.34, H2Opho.35,
                     saaH2O, faal.fold, polar.req,
                     freq, charge, hydropathy,
                     vol.cat, pol.cat, chemical)

Raw data exploration

Maxtrix

TODO: Examine the code below. What is it called? Replace the XXXXXXXX above with the name of this thing

cor_ <- round(cor(aa_dat[,-c(1,13:17)]),2)
diag(cor_) <- NA
cor_[upper.tri(cor_)] <- NA
cor_
##           MW.da   vol  bulk   pol isoelec H2Opho.34 H2Opho.35 saaH2O faal.fold
## MW.da        NA    NA    NA    NA      NA        NA        NA     NA        NA
## vol        0.93    NA    NA    NA      NA        NA        NA     NA        NA
## bulk       0.55  0.73    NA    NA      NA        NA        NA     NA        NA
## pol        0.29  0.24 -0.20    NA      NA        NA        NA     NA        NA
## isoelec    0.20  0.36  0.08  0.27      NA        NA        NA     NA        NA
## H2Opho.34 -0.27 -0.08  0.44 -0.67   -0.20        NA        NA     NA        NA
## H2Opho.35 -0.25 -0.16  0.32 -0.85   -0.26      0.85        NA     NA        NA
## saaH2O     0.97  0.99  0.64  0.29    0.35     -0.18     -0.23     NA        NA
## faal.fold  0.11  0.18  0.49 -0.53   -0.18      0.84      0.79   0.12        NA
## polar.req -0.05 -0.19 -0.53  0.76   -0.11     -0.79     -0.87  -0.11     -0.81
## freq      -0.52 -0.30 -0.04 -0.01    0.02      0.26     -0.02  -0.38     -0.18
##           polar.req freq
## MW.da            NA   NA
## vol              NA   NA
## bulk             NA   NA
## pol              NA   NA
## isoelec          NA   NA
## H2Opho.34        NA   NA
## H2Opho.35        NA   NA
## saaH2O           NA   NA
## faal.fold        NA   NA
## polar.req        NA   NA
## freq           0.14   NA

The code below returns the index of the max value in cor_. The variable that has the most positive correlation is surface area accessible and H2O. The variable with the most negative correlation is the polar requirement and H2Opho.35.

which(cor_ == max(cor_, na.rm = T), arr.ind = T)
##        row col
## saaH2O   8   2
which(cor_ == min(cor_, na.rm = T), arr.ind = T)
##           row col
## polar.req  10   7

Plot 0: Scatterplot matrix

the code below uses a function to create a panel for the scatter plot matrix that compares the variables in cor_. The correlation is outputted in the panel and changing the size of the text based on the correlation.

panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor,...)
{
    usr <- par("usr"); on.exit(par(usr))
    par(usr = c(0, 1, 0, 1))
    r <- abs(cor(x, y))
    txt <- format(c(r, 0.123456789), digits = digits)[1]
    txt <- paste0(prefix, txt)
    if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
    text(0.5, 0.5, txt, cex = cex.cor * r)
}

Relationship between the 6 variables: The strongest correlations are between (MW.da and vol) The weakest correlations are between (vol and H2Opho.34). (MW.da and pol) show a non-linear relationship.

plot(aa_dat[c("MW.da", "vol", "pol", "H2Opho.34", "polar.req", "saaH2O", "faal.fold")], panel = panel.smooth, upper.panel = panel.cor, main = "Example")

Plot 1: Replication of Higgs and Attwood Figure 2.8

This plot represents the amino acids in the data set comparing their isoelectic point vs. their volume.

plot(vol ~ isoelec, data = aa_dat, 
     xlab = "pI", ylab = "Volume(Angstrom)", main = "Amino Acids volume vs. pI",
     col = 0)
text (vol ~ isoelec, labels = aa, data = aa_dat, col = 1:20)

Figure 2: HA Figure 2.8 with ggpubr

ggscatter(y = "polar.req",
x = "freq",
size = "polar.req",
color = "polar.req",
data = aa_dat,
xlab = "pl",
ylab = "Volume (Angstroms)")

Figure 3: Highly correlated data

The regression equation is in the form of y = B0 + B1*x. B0 represents the y-intercept while B1 represents the slope of the regression line. The data ellipse shows the set of points that are within interval from the regression line.

ggscatter(y = "MW.da",
x = "vol",
size = "MW.da",
add = "reg.line", # line of best fit
ellipse = TRUE, # data ellipse
cor.coef = TRUE, # correlation coef
data = aa_dat,
xlab = "Volume",
ylab = "Molecular Weight (dalton)")
## `geom_smooth()` using formula 'y ~ x'

Figure 4: Non-linear relationship

Creating a scatter plot with a non-linear relationship

ggscatter(y = "MW.da",
x = "polar.req",
size = "MW.da",
color = "bulk",
data = aa_dat,
xlab = "Polar Requirement",
ylab = "Molecular Weight (dalton)") + geom_smooth(method = "loess")
## `geom_smooth()` using formula 'y ~ x'

Figure 5: Non-linear relationship on LOG scale

The log transformation allows non-linear data to be converted into somewhat linear data.

aa_dat$MW.da_log<- log(aa_dat$MW.da)
aa_dat$polar.req_log <- log(aa_dat$polar.req)
ggscatter(y = "MW.da_log",
x = "polar.req_log",
size = "MW.da",
color = "bulk",
add = "reg.line",
data = aa_dat,
xlab = "Polar Requirement",
ylab = "Molecular Weight (dalton)") 
## `geom_smooth()` using formula 'y ~ x'

Figure 6: 3D Scatterplot

All three variables have a highly positive correlation. Molecular weight and surface area molecular weight and volume, and volume and surface area have a high positive correlation. The highest correlation is volume and surface area.

scatterplot3d(x = aa_dat$MW.da,
              y = aa_dat$vol,
              z = aa_dat$saaH2O,
              highlight.3d = TRUE,
              type = "h")

Principal component analysis

PCA allows us to manage large datasets by reducing the number of variables. This is especially useful to be able to visualize the relationship between fewer variables, allowing us to draw more accurate conclusions. “The principal components plot also shows that there is no particular similarity between amino acids in the same row” (Higgs and Attwood 2005, pg 4).

In this class, we created PCA plots in 2 ways: basic R PCA function and by using the vegan package. The basic R PCA function is not very flexible. With the vegan package, we used the rda() function to create the plot.

Data Prep

To create this data frame, I dropped the first two columns as well as columns 11-19 since those are the variables that are categorical.

aa_dat2 <- aa_dat[-c(1:2, 11:19)]

Figure 7: Principal components analysis - base R

pca.out <- prcomp(aa_dat2, scale = TRUE)
biplot(pca.out)

Figure 8: Principal components Analysis - Vegan

scale = TRUE makes the PCA to consider correlations.

rda.out <- vegan :: rda(aa_dat2, scale = TRUE)
rda_scores <- scores(rda.out)
rda_scores
## $species
##                  PC1         PC2
## vol        0.1292455 -1.21867402
## bulk      -0.5240422 -1.00021659
## pol        1.0392368 -0.22161142
## isoelec    0.4272346 -0.51780748
## H2Opho.34 -1.1576058 -0.06083202
## H2Opho.35 -1.1909235  0.07563303
## saaH2O     0.2293130 -1.17930581
## faal.fold -1.0664891 -0.35159129
## 
## $sites
##                PC1         PC2
## sit1  -0.431697708  1.01348876
## sit2  -0.812161280  0.50497652
## sit3   0.958626669  0.71357979
## sit4   0.941582653  0.22249037
## sit5  -0.867271594 -0.76051369
## sit6  -0.093468837  1.79557675
## sit7   0.643286377 -0.33344292
## sit8  -1.013877712 -0.56001005
## sit9   1.450808627 -0.67736662
## sit10 -0.904499352 -0.52173690
## sit11 -0.668686375 -0.38778285
## sit12  0.462264010  0.53066191
## sit13  0.007416261  0.36866189
## sit14  0.458970843  0.05206945
## sit15  1.580353509 -1.08451465
## sit16 -0.014704047  1.08414678
## sit17 -0.184996200  0.37158066
## sit18 -0.979015329 -0.23223164
## sit19 -0.420924604 -1.37484059
## sit20 -0.112005910 -0.72479295
## 
## attr(,"const")
## [1] 3.511243

This plot created from the code below differs from Figure 2.10 in Higgs and Attwood because the points and scale are differnet as well the grouping. In my graph I chose to group the points by polarity. Additionally, my plot has 2 distinct clusters=, while Figure 2.10 has 4.

biplot(rda.out, display = "sites", col = 0,)
orditorp(rda.out, display = "sites", labels = aa_dat$aa, cex = 1.2)
ordihull(rda.out,
group = aa_dat$pol.cat,
col = 1:7,
lty = 1:7,
lwd = c(3,6))

Cluster Analysis

Cluster analysis allows us to understand large data sets by grouping related points into clusters. This allows us to visualize different relationships between the data that may not be obvious just by looking at the data.

UPGMA uses a pair-wise distance matrix and compares the species, constructing a tree from the smallest distance between the groups of species. It produces a rooted tree.

Hierarchial Cluster Analysis

Adding row names:

row.names(aa_dat2) <- paste(aa_dat$aa, sep = ".")

Euclidean Distance: a measurement of the distance between two points using the pythagorean theorem.

dist_euc <- dist(aa_dat2, method = "euclidean")

creating UPGMA

cluster_euc <- hclust(dist_euc, method = "average") # setting to UPGMA

plotting resulting dendrogram

plot(cluster_euc,hang = -1, cex = 0.5, labels = aa)

There are a few key differences between this dendrogram and Higgs and Attwood’s. Most of the clades are different, which could be due to a different clustering algorithm or could be due to different variables being used.