Introduction

In this script, we are showing that there are several similarities and differences between amino acids and their properties. We build a data frame then create similarity matrices, scatterplots, and PCAs which allow for a further analysis of the similarities between amino acids. This is important biologically because it allows us to see how amino acids differ from one another and which amino acids are more genetically similar compared to others. Further this will help us in biology because then we can determine that if amino acids change in specific scenarios, then we can better predict what those changes may effect.

Preliminaries

Packages

## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.5-6

Build the dataframe

These data are derived mostly from Higgs (2009) Table 1. They are similar to Higgs and Attwood 2005 but have extra columns. Nathan Brouwer added additional categorical varibles based on online information.

This data represents the amino acids and their properties.

Make the vectors.

# 1 letter code
aa        <-c('A','C','D','E','F','G','H','I','K','L','M','N',
              'P','Q','R','S','T','V','W','Y')


## molecular weight in dalton
MW.da     <-c(89,121,133,146,165,75,155,131,146,131,149,132,115,147,174,105,119,117,204,181)

## vol from van der Waals radii 
vol    <-c(67,86,91,109,135,48,118,124,135,124,124,96,90,
              114,148,73,93,105,163,141)

##  bulk – a measure of the shape of the side chain
bulk <-c(11.5,13.46,11.68,13.57,19.8,3.4,13.69,21.4,
              15.71,21.4,16.25,12.28,17.43,
              14.45,14.28,9.47,15.77,21.57,21.67,18.03)

## pol – a measure of the electric field strength around the molecule
pol  <-c(0,1.48,49.7,49.9,0.35,0,51.6,0.13,49.5,0.13,
              1.43,3.38,1.58,3.53,52,1.67,1.66,0.13,2.1,1.61)

## isoelec point
isoelec <-c(6,5.07,2.77,3.22,5.48,5.97,7.59,6.02,9.74,5.98,
                   5.74,5.41,6.3,5.65,10.76,5.68,6.16,5.96,5.89,5.66)

## 1st Hydrophobicity scale
H2Opho.34 <-c(1.8,2.5,-3.5,-3.5,2.8,-0.4,-3.2,4.5,-3.9,3.8,1.9,
                  -3.5,-1.6,-3.5,-4.5,-0.8,-0.7,4.2,-0.9,-1.3)

## 2nd Hydrophobicity scale
H2Opho.35 <-c(1.6,2,-9.2,-8.2,3.7,1,-3,3.1,-8.8,2.8,3.4,-4.8,
                  -0.2,-4.1,-12.3,0.6,1.2,2.6,1.9,-0.7)

## Surface area accessible to water in an unfolded peptide
saaH2O       <-c(113,140,151,183,218,85,194,182,211,180,204,158,
                 143,189,241,122,146,160,259,229)


## Fraction of accessible area lost when a protein folds
faal.fold    <-c(0.74,0.91,0.62,0.62,0.88,0.72,0.78,0.88,0.52,
                 0.85,0.85,0.63,0.64,0.62,0.64,0.66,0.7,0.86,0.85,0.76)



# Polar requirement 
polar.req    <-c(7,4.8,13,12.5,5,7.9,8.4,4.9,10.1,4.9,5.3,10,
                 6.6,8.6,9.1,7.5,6.6,5.6,5.2,5.4)

## relative frequency of occurance
## "The frequencies column shows the mean 
### percentage of each amino acid in the protein sequences 
### of modern organisms"
freq        <-c(7.8,1.1,5.19,6.72,4.39,6.77,2.03,6.95,6.32,
                10.15,2.28,4.37,4.26,3.45,5.23,6.46,5.12,7.01,1.09,3.3)

##
# charges
## un = Un-charged
## neg = negative
## pos = positive
charge<-c('un','un','neg','neg','un','un','pos','un','pos',
          'un','un','un','un','un','pos','un','un','un','un','un')

## hydropathy
hydropathy<-c('hydrophobic','hydrophobic','hydrophilic','
              hydrophilic','hydrophobic','neutral','neutral',
              'hydrophobic','hydrophilic','hydrophobic','hydrophobic',
              'hydrophilic','neutral',
              'hydrophilic','hydrophilic','neutral','neutral',
              'hydrophobic','hydrophobic','neutral')

## vol
vol.cat<-c('verysmall','small','small','medium',
              'verylarge','verysmall','medium','large','large',
              'large','large','small','small','medium','large',
              'verysmall',
              'small','medium','verylarge','verylarge')

## pol
pol.cat<-c('nonpolar','nonpolar','polar','polar',
                'nonpolar','nonpolar','polar','nonpolar',
                'polar','nonpolar','nonpolar','polar','nonpolar','polar',
                'polar','polar','polar','nonpolar','nonpolar','polar')

## chemical
chemical<-c('aliphatic','sulfur','acidic','acidic','aromatic',
            'aliphatic','basic','aliphatic','basic','aliphatic','sulfur',
            'amide','aliphatic','amide','basic','hydroxyl','hydroxyl',
            'aliphatic','aromatic','aromatic')

Build the dataframes.

### Build dataframe
aa_dat <- data.frame(aa, 
                     MW.da,vol,
                     bulk, pol, 
                     isoelec,H2Opho.34, H2Opho.35,
                     saaH2O, faal.fold, polar.req,
                     freq, charge, hydropathy,
                     vol.cat, pol.cat, chemical)

Raw data exploration

Similarity Matrix

Finds the Correlation - similarity and dissimilarity - between the different variables in the data frame.

cor_ <- round(cor(aa_dat[,-c(1,13:17)]),2)
diag(cor_) <- NA
cor_[upper.tri(cor_)] <- NA
cor_
##           MW.da   vol  bulk   pol isoelec H2Opho.34 H2Opho.35 saaH2O faal.fold
## MW.da        NA    NA    NA    NA      NA        NA        NA     NA        NA
## vol        0.93    NA    NA    NA      NA        NA        NA     NA        NA
## bulk       0.55  0.73    NA    NA      NA        NA        NA     NA        NA
## pol        0.29  0.24 -0.20    NA      NA        NA        NA     NA        NA
## isoelec    0.20  0.36  0.08  0.27      NA        NA        NA     NA        NA
## H2Opho.34 -0.27 -0.08  0.44 -0.67   -0.20        NA        NA     NA        NA
## H2Opho.35 -0.25 -0.16  0.32 -0.85   -0.26      0.85        NA     NA        NA
## saaH2O     0.97  0.99  0.64  0.29    0.35     -0.18     -0.23     NA        NA
## faal.fold  0.11  0.18  0.49 -0.53   -0.18      0.84      0.79   0.12        NA
## polar.req -0.05 -0.19 -0.53  0.76   -0.11     -0.79     -0.87  -0.11     -0.81
## freq      -0.52 -0.30 -0.04 -0.01    0.02      0.26     -0.02  -0.38     -0.18
##           polar.req freq
## MW.da            NA   NA
## vol              NA   NA
## bulk             NA   NA
## pol              NA   NA
## isoelec          NA   NA
## H2Opho.34        NA   NA
## H2Opho.35        NA   NA
## saaH2O           NA   NA
## faal.fold        NA   NA
## polar.req        NA   NA
## freq           0.14   NA

Which variables have the most postive correlation? What have most negative correlation?

The code below is determining which of the variables in the similarity matrix have the most positive correlation and the rows in which it is. The most positive correlation is the number which is closest to 1 (0.99) and the most negative correlation is the number which is closest to -1 (-0.87). The variables which have the most positive correlation is saaH20 and MW.da and the most negative correlation is polar.req and H2Opho.35.

which(cor_ == max(cor_, na.rm = T), arr.ind = T)
##        row col
## saaH2O   8   2
which(cor_ == min(cor_, na.rm = T), arr.ind = T)
##           row col
## polar.req  10   7
print(max(cor_, na.rm = T))
## [1] 0.99
print(min(cor_, na.rm = T))
## [1] -0.87

Plot 0: Scatterplot matrix

The code below takes the six variables “MW.da”,“vol”,“pol”,“H2Opho.34”,“polar.req”, “saaH2O”, “faal.fold” and put them into a scatter plot matrix showing the scatterplot on the bottom left triangle and the upper right triangle has their correlations.

panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor,...)
{
    usr <- par("usr"); on.exit(par(usr))
    par(usr = c(0, 1, 0, 1))
    r <- abs(cor(x, y))
    txt <- format(c(r, 0.123456789), digits = digits)[1]
    txt <- paste0(prefix, txt)
    if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
    text(0.5, 0.5, txt, cex = cex.cor * r)
}

In the created scatter plot matrix the saaH20 and the vol have the greatest, most positive correlation. Polar.req and MW.da have the lowest correlation of 0.05. There are several that do not have linear relationships such as MW.da - pol and polar.req - MW.da.

plot(aa_dat[, c("MW.da","vol","pol","H2Opho.34","polar.req", "saaH2O", "faal.fold")],
     panel = panel.smooth,
      upper.panel =panel.cor ,
     main = "ScatterPlot Matrix")

Plot 1: Replication Higgs and Attwood Figure 2.8

This plot looks at the amino acid properties of the polar requirement and the frequency of occurrences. The letters on the graph are the indications for which amino acid it is talking about.

Figure 2: HA Figure 2.8 with ggpubr

This is taking the same variables used above but making it into a scatter plot with ggpubr. With using color and size to indicate two additional variables including the ones being used.

ggscatter(y = "polar.req",
          x = "freq",
          size = "saaH2O",
          color = "bulk",
          data = aa_dat,
          xlab = "freq",
          ylab = "polar.req")

Figure 3: Highly correlated data

This graph is creating a scatterplot using ggpubr but using two variables which are highly correlated. In this case I am using saaH2O and vol because they have a R value of 0.99. The regression line is created by finding the correlation then wherever the line will be most in the middle of all the data points is where the regression line will be. The data ellipse is used to indicate how close of a correlation the variables have; such as, in this case the ellipse is very thin because of a very high correlation, but if there was not that much of a correlation there would be a much thicker, rounder ellipse circling the data points.

ggscatter(y = "saaH2O",
          x = "vol",
          size = "saaH2O",
          add  = "reg.line", #line of best fit
          ellipse = TRUE, # data ellipse
          cor.coef = TRUE,#correlation coefficient
          data = aa_dat,
          xlab = "vol",
          ylab = "saaH2O")
## `geom_smooth()` using formula 'y ~ x'

Figure 4: Non-linear relationship

In this graph, the two variables used are polar.req and vol because of their non-liner relationship. I used a loess smoother which is function of ggpubr which allows for a better prediction of the line of best fit. I also used the variables freq and saaH2O for the color and size analysis to add 2 other numeric variables.

ggscatter(y = "polar.req",
          x = "vol",
          data = aa_dat,
          size = "freq",
          color = "saaH2O",
          add = "loess", #adds smoother through ggpubr
          xlab = "vol",
          ylab = "polar.req")
## `geom_smooth()` using formula 'y ~ x'

Figure 5: Non-linear relationship on LOG scale

This graph uses the same variables as the previous graph but changes them to use a log-based scale rather than their regular scale. Using the log based scale allows for the graph to use smaller values that are closer together rather than using bigger values which make the graph harder to interpret.

aa_dat$polar.req_log <- log(aa_dat$polar.req) #creating a new column with polar.req_log
aa_dat$vol_log <- log(aa_dat$vol) #creating a new column with vol_log

ggscatter(y = "polar.req_log",
          x = "vol_log",
          data = aa_dat,
          size = "freq",
          color = "freq",
          add = "loess", #adds smoother through ggpubr
          xlab = "vol_log",
          ylab = "polar.req_log",
          )
## `geom_smooth()` using formula 'y ~ x'

Figure 6: 3D Scatterplot

The three variables I chose for this 3D scatterplot are MW.da, vol, an saaH2O. I chose these variables because they have strong correlations with each other. The correlation between MW.da and vol is 0.93, MW.da and saaH2O is 0.97, and vol and saaH2O is 0.99.

scatterplot3d(x = aa_dat$MW.da,
              y = aa_dat$vol,
              z = aa_dat$saaH2O,
              highlight.3d = TRUE,
              type = "h")

Principal Component Analysis

Subsetting Dataframe manually by using the names of the column.

aa_dat2 <- aa_dat[,c("vol", "bulk", "pol", "isoelec", "H2Opho.34", "H2Opho.35",
                     "saaH2O", "faal.fold")]

Figure 7: Principal components analysis - base R

“The PCA method transforms this cloud of points first by scaling them and shifting them to the origin, and then by rotating them in such a way that the points are spread out as much as possible, and the structure in the data is made easier to see” (Higgs and Attwood 2005, pg 26). In this class we use PCA as a dimension reduction method. The two basic ways we create a PCA plot in this class is using base R and by using vegan. The PCA using vegan is sometimes easier to comprehend because it is easier to look at graphically and is not as “clunky” as PCA using R.

Running of the PCA and creating a PCA plot using base R.

pca.out <- prcomp(aa_dat2, scale = TRUE)

Plotting the output of the PCA.

biplot(pca.out)

Figure 8: Principal components analysis - vegan

Running a PCA using vegan instead of base R.

rda.out <- vegan::rda(aa_dat2,
                      scale = TRUE) # scale makes the scale more concise uses smaller                                           intervals. If you were to put scale = False the                                           scale would be from -10 - 10 rather than -3 - 3 from                                       the x axis.  
rda_scores <- scores(rda.out) #extract PCA scores

biplot(rda.out, display = "sites", col = 0,)
orditorp(rda.out, display = "sites", labels = aa_dat$aa, cex = 1.2)
ordihull(rda.out,
        group = aa_dat$hydropathy,
        col = 1:7,
        lty = 1:7,
        lwd = c(3,6))

The above plot is the plot from Higgs and Attwood using hydropathy and the below plot uses pol.cat. The main differences between the two plots are that the plot from Higgs uses 3 categories [hydrophobic, hydrophilic, and neutral] and the variable I used only have 2 categories [polar and nonpolar]; therefore, in the plots the Higgs plot has 3 circles and my plot have only 2 circles correlating to the number of categories.

biplot(rda.out, display = "sites", col = 0,)
orditorp(rda.out, display = "sites", labels = aa_dat$aa, cex = 1.2)
ordihull(rda.out,
        group = aa_dat$pol.cat,
        col = 1:7,
        lty = 1:7,
        lwd = c(3,6))

Cluster Analysis

Hierarchial Cluster Analysis (UPGMA)

The purpose of a cluster analysis is to analyze the similarities between two amino acids. It will show which amino acids are closely related compared to ones that are further apart allowing to “cluster” and “group” the variables.

The distance matrix shows the pairwise distances between two structures. It shows which variables are most similar and which are not. The euclidean distance is used because it determines the distance between two “cells.” It allows for a universal way of calculating the distances between two objects in the distance matrix.

The cluster dendrogram created here is a little different from the one created in the the Higgs paper because the amino acids are rearranged a little differently and in the Higgs articles, they have a better image on what their dendrogram is meaning. The dendrogram created here is similar to the one created by Brouwer in that they have the same rearrangement of amino acids, but mine is not oriented the same as it.

row.names(aa_dat2) <- paste(aa_dat2$Species,1:nrow(aa_dat2),sep = ".")

dist_euc <- dist(aa_dat2, method = "euclidean")
clust_euc <- hclust(dist_euc, method = "average") # method =    "average" allows for                                                     the UPGMA
plot(clust_euc, labels = aa_dat$aa)